RAOC-Ta-88-69,  Vol  I  (of  two) 
PNH  Technical  Report 

March  i«m 


AD-A200  204 


R/M/T  DESIGN  FOR  FAULT 
TOLERANCE,  PROGRAM 
MANAGER’S  GUIDE 


Grumman  Aerospace  Corporation 
Dv'vfci  J.  Conroe  end  Stanley  J.  Mum,  Jr. 


APPROVED  FOP  PUBLIC  RELEASE;  :  >!Sh  JBUTK)N  UNLIMITED. 


ROME  AIR  DEVELOPMENT  CENTER 
Air  Pome  Systems  Command 
Grtfftss  Air  Pores  Bass,  NY  13441-5700 


88  8  19 


This  report  has  been  reviewed  by  l  RADC  Public  Affairs  Division  (PA) 
and  is  releasable  to  the  National  Technical  Information  Service  (NTIS) .  At 
NTIS  it  will  be  releasable  to  the  general  public,  including  foreign  nations. 

RADC-TR-83-69,  Volume  I  (of  two)  has  been  reviewed  and  is  approved  for 
publication. 


If  your  address  has  changed  or  if  you  wish,  co  be  removed  from  the  RADC 
mailing  list,  or  if  the  addressee  is  no  longer  employed  by  your  organization, 
please  notify  RADC  (RBET  )  Griff iss  AFB  NY  13441-5700.  This  will  assist  us 
in  maintaining  a  current  mailing  list. 

Do  not  return  copies  of  this  report  unless  contractual  obligations  or  notices 
on  a  specific  i  >ucment  require  that  it  be  returned. 


x 


UNCLASSIFIED 


REPORT  DOCUMENTATION  PAOE 


ffpfRI  ApfVOVltf 
CM>  Mo.  0704-0191 


UNCLASSIFIED 


ADON/DO 


Grumman  Aerospace  Corporation 

BISS 

S.  Oyster  Bay  Boad 
Bethpage  NT  11714 

to.  NAME  OF  FUNDING  /  SPONSORING 
ORGANIZATION 

Rosie  Air  Developaent  Center 
to.  ADDRESS  fCly,  ftato.ardlitcoJb) 
Griff iss  AFB  NY  13441-5700 


to.  OTFiCC  SYMBOL 
6NJ 


Approved  for  public  release; 
distribution  unlimited. 


RADC-TR-88 -69 ,  Vo Is 


SWAT  NUMMRf 
I  (of  two) 


Some  Air  Development  Center  (RBET) 

7b.  ADDRESS  (Oly.  »a«»,  andZVCoJp) 

Griffiss  AFB  NY  13441-5700 

9  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 
F30602-85-C-0161 


ELEMENT  NO 
62702F 


FUNDING  NUMBERS 

PROJECT  IT 

NO.  I 

2338 


WORK  UNIT 
ACCE|SK>N  NO. 


11.  TITLE  (metudt  St-wRy  CESmtSSow 

R/M/T  DESIGN  FOR  FAULT  TOLERANCE,  PROGRAM  MANAGER'S  GUIDE 


mlev  J.  Mum.  Jr. 

Ili.  TYPE  Of  REPORT  113b  TIME  COVERED  lie.  DATE  OF  REtOrtT  (Yaar.  Month  Dty)  115  RAGE  COUNT 

Final  from  Oct  85  to  May  bS  March  1988  166 

It.  SURRLE  ME  NT  ARY  NOTATION 

RADC-TR-88-69,  Volume  II  (of  tvo)  will  be  published  at  a  later  date. 

17.  COSATI  COOES  II.  SUBJECT  TERMSlcontlno*  on  men*  It  n#ce**R*y  and  khnttfy  bf  block  numbw) 

fkld  grouf  sub-GROup  Reliability  Fault  Tolerance 

- 55 - Maintainability  Program  Manegement 

. .  . . .  —  —  ■■  —  ■■  ■■■  Testability  Design  Guidance 

TTbSTRACT  (ConWnuJ1"a«f»*#m  if  imtSHyancT 3m^  by  bloc*  numbtr) 

7 Fault  tolerance  has  come  into  almost  universal  use  in  modem  day  systems  of  all  types.  This 
report  con  •  ins  design  guidance  and  general  information  for  Air  Force  and  contractor  program 
managers  v  _ii  respect  to  the  nature  and  form  of  Reliability) Maintainability/Testability 
(R/M/T)  casks  needed  in  the  developaent  of  fault  tolerant  systems.  This  program  managers 
guide  contains  instructions  for  tailoring  the  R/M/T  progranawtic  standards  (MIL-STDs  785, 

470  &  2165)  for  fault  tolerant  syatems  development.  Important  fault  tolerance  design  options 
and  tradeoff  analysis  methods  are  discussed  to  aid  the  program  manager  in  understanding  and 
overseeing  the  entire  fault  tolerant  system  design  process.  This  report  is  Volume  I  of  II. 
Volume  II  will  be  an  R/M/T  Fault  Tolerant  Design  Implementation  Guide  available  at  a  later 
date^^nd  will  contain  a  more  in-depth  view  of  the  technical  issues  regarding  fault  tolerance 
design  techniques.  »  >  .  .  > 

L>  '■  "cn  '  Ayl  °  Cc/ri>:  i.'tt  trdi/c  i*J  <trd  c( 

7  Suzic ,  ( r  A c  )  v 


FIELD 

GROUP  SUB-GROUP 

l3 

08 

'  Mock  fiumbf) 


' — “>  /f  Pu  U)C>  \  c/  5  •  ^  c  '  *  ' 
/  -Syzjcu’C  , 

70.  DISTRIBUTION /AVAILABILITY  OF  ABSTRACT 

BBunclassifieo/unlimited  □  same  as  rpt. 

72a.  NAME  OF  RESPONSIBLE  INDIVIDUAL 
Joseph  A.  Caroll _ 

DO  Form  1473,  JUN  M 


21.  ABSTRACT  SECURITY  CLASSIFICATION 
UNCLASSIFlEb 


□  OTIC  USERS 


ATION  OF  THIS  PAGE 
PIED 


Previous  tdttlons  art  otoofata. 


EXECUTIVE  SUMMARY 


Th'.s  Program  Manager’s  Guide  was  prepared  by  the  Aircraft  Sys¬ 
tems  Division  of  the  Grumman  Corporation  under  RADC  contract  F30602- 
85-C-0161,  entitled  Reliability  /Maintainability /Testability  (R/M/T)  Design 
for  Fault  Tolerance.  The  objectives  of  this  document  are  to  provide  Air 
Force  and  contractor  program  managers  with  guidance  on  how  to  address 
fault  tolerant  design  issues  and  needs,  and  to  provide  general  informa¬ 
tion  on  state-of-the-art  R/M/T  fault  tolerance  techniques.  A  R/M/T 
Fault  Tolerant  Design  Implementation  Guide ,  which  will  contain  a  more 
in-depth  technical  treatise  on  fault  tolerance  techniques  and  analyses 
methodologies  for  use  by  the  Air  Force  and  contractor  technical  person¬ 
nel,  is  also  being  prepared  under  this  contract  and  will  be  available  by 
the  end  of  1988.  These  Guides  are  being  developed  to  structure  cost 
effective  programs  for  reliable,  maintainable  and  testable  fault  tolerant 
C3 1  (Command,  Control,  Communications  and  Intelligence)  systems. 

When  properly  applied,  fault  tolerance  can  significantly  and  effec¬ 
tively  enhance  the  mission  capabilities  of  C3I  systems.  It  is  imperative, 
however,  that  program  managers  understand  the  configuration  selection 
process  to  avoid  the  infusion  of  unnecessary  system  complexities  that 
contribute  little  to  mission  capability  and  increase  life-cycle  cost.  The 
system  performance,  supportability  and  cost  of  competing  fault  tolerance 
approaches  must  be  clearly  defined  early  in  the  development  phase  to 
support  critical  management  configuration  decisions. 

This  Program  Manager's  Guide  provides  the  essential  background 
information  needed  by  Air  Force  and  contractor  program  managers  to  un¬ 
derstand  the  specification,  design  and  tradeoff  analyses  required  for 


fault  tolerant  C9I  system  developments.  It  is  organized  in  a  manner  that 
follows  the  configuration  development  process  and  addresses  the  following 
critical  areas: 

o  R/M/T  program  planhing  and  management 
o  Specification  of  fault  tolerance  and  R/M/T  requirements 
o  Relationship  of  fault  tolerance  to  mission  and  safety  criticality 
o  Guidance  for  design  of  fault  tolerance 
o  Evaluation  of  design  cost  effectiveness. 

This  Guide  can  be  used  either  as  a  tutorial  aid  or  a  management 
reference  document.  Numerous  fault  tolerance  examples  are  presented 
which  illustrate  the  potential  benefits  that  can  be  derived  and  areas  of 
application.  Graphics  and  emphasized  type  fonts  are  used  extensively  in 
this  guide  to  summarize  the  material  presented  and  to  highlight  important 
management  issues.  In  addition,  checklists  are  located  at  the  end  of 
each  section.  These  checklists  provide  a  handy  reference  of  major  per¬ 
tinent  R/M/T  impact  areas  that  program  managers  should  address  in  fu¬ 
ture  fault  tolerant  C3I  development  programs. 


PREFACE 


This  Program  Manager's  Guide  was  prepared  by  Grumman  Aircraft 
Systems  Division  Reliability,  Maintainability  and  Safety  Section,  Beih- 
page.  New  York  for  Rome  Air  Development  Center,  Griffiss  Air  Force 
Base,  New  York.  Mrs.  Heather  Dussault  and  Mr.  Joseph  Carol!  (RBET) 
were  the  RADC  Project  Engineers. 

This  Guide  was  developed  during  the  period  between  September  1985 
through  March  1987.  In  addition  to  the  authors,  David  Conroe  and  Stan¬ 
ley  Murn,  Jr.,  other  Grumman  study  team  contributors  were  Messrs.  Gary 
Bigel,  Allan  Dantowitz,  John  DiLeo,  Theodore  oordan,  Kenneth  Hallor, 
John  Kappler,  Robert  Messina,  Victor  Pellirione,  George  Pflugel  and 
Edward  Ramirez. 


v/vi 


CONTENTS 


Section  Page 

1  INTRODUCTION . 1-1 

2  R/M/T  PROGRAM  PLANNING  AND  MANAGEMENT . 2-1 

2.1  System  Tailored  Approach . 2-1 

2.1.1  Reliability  Program  Tailoring . 2-4 

2.1.2  Maintainability  Program  Tailoring . 2-12 

2.1.3  Testability/Diagnostic  Program  Tailoring  .  .  .  2-20 

2.1.4  Software  Program  Tailoring . 2-31 

2.2  Program  Planning  and  Management  Checklist 

Questions . 2-36 

2.3  Specification  of  Fault  Tolerance  and  R/M/T 

Requirements . 2-37 

2.3.1  System  Quantitative  R/M/T  Requirements  .  .  2-38 

2.3.2  Verification . 2-50 

2.3.3  Warranties . 2-53 

2.4  Specification  Checklist  Questions . 2-55 

3  RELATIONSHIP  OF  C3I  FAULT  TOLERANCE  TO 

MISSION  AND  SAFETY  CRITICALITY . 3-1 

3.1  Formulation  of  Cal  Fault  Tolerance  Requirements  .  .  3-1 

3.2  Examples  of  Typical  C3(  Fault  Tolerance 

Applications . 3-6 

3.2.1  Space  Surveillance  System . 3-6 

3.2.2  Airborne  Surveillance  Radar  System . 3-12 

3.3  Fault  Tolerance  Requirements  Checklist . 3-16 


vii 


CONTENTS  (contd) 


Section  Page 

4  GUIDANCE  FOR  DESIGN  OF  FAULT  TOLERANCE . 4-1 

4.1  Hardware  and  Software  Fault  Tolerance  Design 

Options  .  .  .  .  * . 4-1 

4.1.1  Redundancy  Techniques . 4-4 

4.1.2  Active  Redundancy .  4-8 

4.1.3  Standby  Redundancy . 4-14 

4.1.4  Voting  Redundancy . 4-15 

4.1.5  Hybrid  Redundancy . 4-16 

4.1.6  K  of  N  Configurations . 4-17 

4.1.7  Graceful  Degradation . 4-18 

4.1.8  Fault  Detection  Techniques . 4-20 

4.1.9  Error  Detection  Codes . 4-21 

4.1.10  Distributed  Processing . 4-21 

4.1.11  Hardware  and  Software  Fault  Tolerant 

Design  Checklist  Questions . 4-24 

4.2  Maintainability/Testability  Impact  on  Fault 

Tolerant  Design  Optio.ni . 4-25 

4.2.1  Testability  of  Fault  Tolerant  Designs  ....  4-25 

4.2.2  Maintainability  of  Fault  Tolerant  Designs.  .  .  4-31 

4.2.3  Maintainability  and  Testability  Checklist 

Questions . .  .  4-34 

5  FAULT  TOLERANCE  DESIGN  AND  TRADEOFF  ANALYSES.  ;V1 

5.1  Fault  Tolerance  Design  Methodology . 5*1 

5.1.1  Baseline  Design . 5-1 

5.1.2  Fault  Avoidance  Techniques . 5-5 

5.1.3  Development  of  the  Fault  Tolerant 

Design  Approach . 5-5 

5.1.4  Fault  Detection  Implementation . 5-6 


viii 


CONTENTS  (contd) 


Section  Pago 

5.1.5  Recovery  Implementation . 5-6 

5.1.6  R/M/T  Evaluation  Technique* . 5-7 

5.1.7  Fault  Tolerant  Deaign  Methodology 

Checklist  Questions . 5-8 

5.2  -  R/M/T  Design  Tradeoff  Analyses . 5-9 

5.2.1  Readiness  Analysis . 5-9 

5.2.2  Logistics  Resource  Analysis . 5-13 

5.2.3  Mission  Effectiveness  Analysis .  5-14 

5.2.4  Life-Cycle  Cost  (LCC)  Analysis . 5-14 

5.2.5  R/M/T  Design  Tradeoff  Analysis 

Checklist  Questions . 5-16 

6  ACRONYMS . .  6-1 

Appendix  Page 

A  GLOSSARY  OF  RELIABILITY,  MAINTAINABILITY, 

TESTABILITY  AND  FAULT  TOLERANCE  TERMS . A-i 


B  REFERENCES  &  LIST  OF  GOVERNMENT  DOCUMENTS.  .  . 


LIST  OF  ILLUSTRATIONS 


Figure  Pag* 

1*1  Manager's  Guide  Overview  of  C*l  Fault  Tolerant 

Design  Process . 1*3 

2*1  System  Testability  Program  Flow  Diagram . 2*21 

2- 2  Verification  of  System  Performance  Characteristics . 2*43 

3- 3  Examples  of  Reliability  Specification  of  Fault  Tolerant 

Systems . 2*45 

i 

2- 4  Model  Reqjirements  for  Testability  in  a  System 

Specification . 2*49 

3- 1  Identifioatiqn  of  Mission  &  Safety  Critical  Fault 

Tolerance  Requirements . 3*3 

3*2  Space  Surveillance  System  Fault  Tolerance . 3*9 

3- 3  Cal  Airborne  Surveillance  Radar  Operational  Concept  ....  3*13 

4- 1  Elements  of  Fault  Tolerance . 4-2 

4-2  Fault  Tolerance  Design  Options  . . 4-9 

4- 3  Graceful  Degradation  of  Antenna  Receive/T;ransmit  (R/T) 

Modules . 4-19 

5- 1  Fault  Tolerance  Design  Methodology . 5-3 

5-2  Factors  &  Relationships  Affecting  Readiness  &  System 

Effectiveness . 5-10 

5-3  Relationship  of  Availability  and  Its  Drivers 

MTTR  &  MTBM . 5-12 

5-4  Relationships  of  A0,  Reliability,  &  MRT . 5-13 


xi/xii 


LIST  OF  TABLES 


Table  Page 

2-1  MIL-STO-785  Reliability  Task  Applications 

Guidanca  Matrix  for  Fault  Tolarant  Systems . 2-6 

2-2  MIL-STD-470  Maintainability  Task  Applications 

Guidanca  Matrix  for  Fault  Tolarant  Systems . 2-13 

2-3  MIL-STD-2165  Testability  Task  Applications 

Guidanca  Matrix  for  Fault  Tolerant  Systems . 2-25 

2-4  Typical  Distribution  of  Software  Development 

Schedule  6  Personnel  Effort  by  Phase . 2-33 

2-5  DoD-STD-2167  Software  Documentation  Requirements 

Matrix . 2-34 

2- 6  Notational  Diagnostic  Performance  Specification . 2-51 

3- 1  Typical  Functional  Criticality  Prioritization . 3-6 

4- 1  Detection  Technique  Characteristics . 4-13 

4-2  Properties  of  Error  Detection  Codes . 4-13 

4- 3  Maintenance  Concept  Options . . .  ,  4-35 

5- 1  Characteristics  of  Current  Reliabiiity  Models . 5  -8 


xiii/xiv 


1  -  INTRODUCTION 


Reliability,  maintainability,  and  testability  (R/M/T)  are  essential  system 

attributes  required  to  achieve  the  Command,  Control,  Communications 

and  Intelligence  program  objectives  of  high  system  effectiveness  and  min* 

Imum  life-cycle  cost  (LCC).  A  fault  tolerant  system  design  is  one  that 

has  provisions  to  avoid  failure  after  faults  have  caused  errors  within  the 

system.  Therefore,  fault  tolerant  design  approaches  can  significantly 
3 

increase  C  l  system  reliability,  and  are  often  required  to  meet  stringent 

3 

reliability  requirements,  assure  the  availability  of  critical  C  I  mission 
functions,  and  avoid  potential  safety  hazards. 

The  objective  of  this  study  is  to  provide  R/M/T  design  guidance  for 

3 

fault  tolerant  C  l  systems.  The  design  guidance  addresses  program 
management  functions  and  information  sources,  and  has  been  tailored  for 
use  by  Air  Force  (AF)  system  planners  and  AF  contractors.  The  guide¬ 
lines  developed  as  part  of  this  study  should  foe  used  to  develop  cost- 
effective  requirement  planning  and  design  dev  slept 'ent  programs  for  reli¬ 
able,  maintainable,  and  testable  vault  tolerant  systems.  Air  Force  and 
contractor  program  managers  must  orovido  the  leadership  to  control  the 
fault  tolerant  design  process  to.  assure  that  system  effectiveness  and 
LCC  faro  not  compromised.  Figure  1-1  provides  an  /erview  of  this  Pro¬ 
gram  Manager  s  Guide,  and  depicts  the  design  process  for  establishing 
fault  tolerant  configurations  from  the  program  system  requirements  step 
through  tradeoff  analyses  of  alternate  design  approaches.  As  illustrated 
in  this  figure,  the  Guide  is  conveniently  organized  chronologically  by 
each  step  of  the  fault  tolerant  design  process. 

Section  2  of  the  Guide  contains  an  approach  for  planning,  man- 

3 

aging,  and  tailoring  R/M/T  programs  for  C  I  fault  tolerant  system 
development.  Tailoring  is  the  process  by  which  individual  requirements 


1-1 


are  evaluated  to  determine  the  extent  to  which  they  are  suited  for  a 
particular  system  development  and  acquisition.  The  approach  recom¬ 
mended  in  Section  2  evolved  from  an  extensive  review  of  applicable 
military  standards  governing  the  conduct  of  R/M/T  and  Safety  programs 
for  systems  and  equipment.  This  section  also  contains  R/M/T  program 
task  application  matrices,  flow  diagrams,  and  guidelines  for  the  speci¬ 
fication  of  lault  tolerance  and  R/M/T  requirements. 

3 

Section  3  describes  the  relationship  between  C  I  program  require¬ 
ments  and  mission  and  safety  criticality.  Fault  tolerance  should  be  in¬ 
corporated  into  a  design  as  part  of  the  system  engineering  process, 
since  experience  has  shown  that  a  hierrachical  approach  involving  the 
selective  application  of  fault  tolerant  design  techniques  is  most  effective. 
In  general,  fault  tolerant  design  methodology  used  by  system  engineering 
personnel  consists  of  first  creating  a  baseline  design  and  then  systemat¬ 
ically  introducing  appropriate  levels  of  fault  tolerance  required  to  meet 
R/M/T  requirements.  A  key  ingredient  in  fault  tolerant  design  is  the 
application  of  hardware  redundancy.  Since  added  hardware  increases 
maintenance,  weight,  volume,  complexity,  cost,  and  spares,  it  is  impor¬ 
tant  that  fault  tolerant  design  techniques  are  not  used  indiscriminately. 

Section  4  delineates  the  R/M/T  attributes  of  the  various  fault  toler¬ 
ant  design  options  along  with  typical  application  areas.  Section  5  con¬ 
tains  a  description  of  fault  tolerant  design  methodology  and  presents  the 
methodology  used  to  evaluate  the  cost  effectiveness  of  alternative  fault 
tolerant  design  options.  Appendix  A  contains  a  glossary  of  R/M/T  and 
fault  tolerance  terms.  Sources  of  information  used  in  this  study  includ¬ 
ed  DoD  directives,  NASA,  DoD  and  military  standards,  military  hand¬ 
books,  open  literature,  and  RADC  technical  reports  on  R/M/T  for  fault 
tolerance.  Appendix  B  contains  a  list  of  references  including  the  identi¬ 
fication  of  the  exact  issue  of  military  and  DoD  standards  and  NASA  doc¬ 
uments  referenced  in  the  Guide. 

Within  this  Guide,  attempts  were  made  to  identify  the  individual 
responsibilities  of  both  AF  and  contractor  program  managers  in  the  fault 


1-2 


2.0  R/M/T  PROGRAM  PLANNING 
ft  MANAGEMENT 


R/M/T  REQMTS  & 
CONSTRAINTS 

•  MISSION  SUCCESS  PROB. 

•  LOGISTICS 

•  WEIGHT/VOL/POWER 

•  ETC 


C3!  PROGRAM  REQMTS 


IDENTIFY  CRITICAL  MISSION 
AND  SAFETY  FUNCTIONS 

•  MISSION  CRITICALITY 
ANALYSIS 

•  SAFETY  ASSESSMENT 


R87-501 1-500 
RB7-3537-001  (T) 


I 


! 


CE  FOR 
LTTOLI 


ELOP  Al§ 
-T  TOLEfl 
pUNDAT 
ftACEFU 
fCONFiq 
C 


DEVELOP  ALTERNATE 
FAULT  TOLERANT  DESIGNS 

•  REDUNDANCY 

•  GRACEFUL  DEGRADATION 

•  RECONFIG  STRATEGIES 

•  ETC 


CONDUCT  DESIGN 
TRADE-OFF  ANALYSES 

•  R/M  IMPACT 

•  READINESS 

•  LCC 


H 


SELECT  FAULT 
TOLERANT  CONFIGURATION 


Figure  I  T.  Manager'*  Guide  Overview  of  C  j 
Fault  Tolerant  Deiign  Itocetv 


'2- 


1-3  M 


tolerance  design  process.  The  AF  program  managers  are  responsible  for 
the  establishment  of  system  program  requirements  and  the  approval  of 
design  configurations.  The  prime  and  systems  integration  contractors 
are  responsible  for  the  development  and  optimization  of  design  config¬ 
urations  that  satisfy  the  system  requirements.  To  assure  a  cost-effec¬ 
tive  program,  both  the  AF  and  the  contractor  must  work  together  to  for¬ 
mulate  realistic  system  requirements  and  conduct  design  tradeoff  analy¬ 
ses.  Therefore,  all  the  material  presented  herein  should  be  of  interest 
to  both  AF  and  contractor  program  managers. 

Program  managers  should  address  the  checklist  questions  provided 
at  the  end  of  each  section.  Unless  specifically  noted,  the  checklist 
questions  apply  both  to  AF  and  contractor  program  managers.  These 
questions  are  particularly  applicable  at  the  System  Requirements  Review 
(SRR),  Preliminary  Design  Review  (PDR),  and  Critical  Design  Review 
(CDR)  to  supplement  the  R&M  evaluation  criteria  listed  in  MIL-STD-1521 , 
Technical  Reviews  and  Audits  for  Systems,  Equipment  and  Computer 
Software.  Questions  primarily  addressed  to  the  Procuring  Activity  are 
followed  by  a  (PA).  Those  addressed  to  integrating  or  prime 
contractors  are  followed  by  a  (C). 


1-5/6 


2  -  R/M/T  PROGRAM  PLANNING  AND  MANAGEMENT 

This  section  contains  an  approach  to  tailoring  R/M/T  programs  for 
C3I  fault  tolerant  systems  development.  Task  application  matrices,  flow 
diagrams  and  areas  of  special  emphasis  are  provided  to  assist  AF  pro* 
gram  managers  in  R/M/T  program  planning  and  management.  Manage¬ 
ment  guidelines  are  provided  for  the  specification  of  fault  tolerance  and 
R/M/T  requirements,  for  software  program  management  and  for 
Reliability  and  Maintainability  (R&M)  warranties. 

2.1  SYSTEM  TAILORED  APPROACH 

The  R/M/T  tasks  and  associated  application  matrices  (delineated  in 
MIL-STD-785B,  MIL-STD-470A  and  MIL-STD-2165,  respectively)  are  ap¬ 
plicable  to  the  development  programs  of  fault  tolerant  systems.  In  gen¬ 
eral,  these  military  standards  adequately  describe  the  R/M/T  tasks  rec¬ 
ommended  for  implementation  when  developing  fault  tolerant  systems. 

However,  there  are  some  task  guidelines  and  tailoring  that  an  AF 
program  manager  should  consider  when  developing  a  Statement  of  Work 
(SOW)  for  a  fault  tolerant  system.  These  guidelines  are  described  in 
paras.  2.1.1,  2.1.2,  and  2.1.3  of  this  Guide  along  with  descriptions  of 
other  tasks  that  are  important  in  the  formulation  of  ovarall  R/M/T  pro¬ 
grams. 

R/M/T  task  tailoring  depends  upon  the  performance  requirement 
levels  that  must  be  achieved  and  the  expected  extent  of  new  design  and 
development  involved.  For  example,  a  new  strategic  C3I  system  would 
require  a  more  extensive  application  of  R/M/T  tasks  than  that  of  an  evo- 

3 

lutionary  C  I  system  design  approach  which  utilizes  existing  and  quali¬ 
fied  equipment/subsystems.  All  procurements  require  analysis  to  specify 


2-1 


R/M/T  levels.  If  mission  criticality  requirements  are  found  to  be  tow,  it 
may  be  possible  to  reduce  acquisition  costs  by  procuring  commercial  off- 
the-shelf  (COTS)  equipment.  In  general,  the  procurement  of  COTS 
equipment  requires  effort  to  select  items  with  "as,  is**  suitability  and 
demonstrated  acceptability  to  meet  program  needs.  (Refer  to  MIL-HDBK- 
338,  para.  12.7.)  Hence,  the  emphasis  in  procurement  of  COTS  equip¬ 
ment  is  in  selection,  not  specification.  For  these  reasons,  a  reduced  set 
of  R/M/T  tasks  may  be  appropriate  and  cost-effective  for  fault  tolerant 
C*l  system  programs  that  incorporate  extensive  use  of  COTS  equipment. 

An  Air  Force  program  manager  should  consider  the  following  subset 
of  R/M/T  tasks  when  developing  Statements  of  Work: 

e  Program  plans  -  Since  the  program  plan  identifies  and  ties  to¬ 
gether  all  program  management  tasks  deemed  necessary  to  sup¬ 
port  the  economical  achievement  of  overall  R/M/T  program  objec¬ 
tives,  the  plan  is  a  necessary  ingredient  in  any  system  develop¬ 
ment/acquisition 

e  Allocation  of  specification  requirements  -  The  allocation  process 
is  necessary  since  it  transforms  overall  system  R/M/T  require¬ 
ments  into  manageable  lower  level  requirements  for  subsystems 
and  equipments 

e  Design  criteria  -  Provide  standards  for  design  compliance  and 
help  shape  fault  tolerant  system  architectures  with  the  minimum 
of  added  redundancy  and  complexity 

e  Trade  studies  -  Tradeoffs  between  alternate  fault  tolerant  con¬ 
figurations  which  are  capable  of  meeting  system  R/M/T  require¬ 
ments  are  mandatory  to  assure  that  the  most  cost-effective  de¬ 
sign  approach  is  utilized 

e  Subcontractor  and  supplier  control  -  The  primary  contractor's 
understanding  and  control  of  a  subcontractor's  R/M/T  program  is 
fundamental  to  meeting  overall  program  goals 


2-2 


•  Thermal  design  analysis  -  Reduction  in  the  operating  temperature 
of  components  is  a  primary  method  of  improving  reliability,  and 
is  often  as  important  as  circuit  design  in  obtaining  the  necessary 
performance  characteristics  from  electronic  equipment 
e  Predictions  (including  Built-In  Test  (BiT)/preventive  mainte¬ 
nance/diagnostic  capability)  -  Predictions  combine  lower  level 
R/M/T  data  to  indicate  equipment  parameters  at  successively 
higher  levels  from  subassemblies  through  subsystems  to  the 
system.  Predictions  that  fall  short  of  requirements  at  any  level 
may  signal  the  need  for  management  and  technical  action 
e  Effects  of  functional  testing,  storage,  handling,  packaging, 
transportation,  and  maintenance  -  The  results  of  analyses  in 
these  areas  are  needed  to  support  long-term  failure  rate  pre¬ 
dictions,  design  tradeoffs,  definition  of  allowable  test  exposures, 
packaging,  handling  and  storage  requirements,  and  refurbishment 
plans 

e  Test/ verification  planning  -  R/M/T  test  and  verification  proce¬ 
dures  are  required  to:  (1)  disclose  deficiences  in  the  system  de¬ 
sign,  material,  and  workmanship;  (2)  provide  R/M/T  data  for 
estimates  of  operational  readiness,  mission  success,  maintenance 
manpower,  and  logistics  support  costs;  and  (3)  determine  compli¬ 
ance  with  quantitative  R/M/T  requirements 
e  Environmental  Stress  Screening  (ESS)  -  ESS  procedures  are  re¬ 
quired  so  that  failures  due  to  weak  parts,  workmanship  defects, 
and  other  non-conformance  anomolies  can  be  identified  and  re¬ 
moved  from  the  equipment,  or  so  appropriate  redesign  measures 
may  be  taken 

e  Failure  Reporting,  Analysis  and  Corrective  Action  System 
(FRACAS)  -  A  well  organized  system  for  collecting, 
analyses/ review,  dissemination,  and  close-out  of  failure  reports 
is  essential  to  the  workings  of  an  R&M  program,  and  can  provide 
management  visibility  into  problem  areas 


2-3 


•  Participation  in  design  raviawt  (PDR,  CDR,  ate.)  -  Review  of 
R/M/T  program  status  at  specified  points  is  necessary  to  assure 
that  the  program  is  proceeding  in  accordance  with  contractual 
milestones  and  that  system  R/M/T  requirements  will  be  achieved 
e  Operational  assessment  -  Operational  systems  should  be 
continually  assessed  to  assure  that  they  are  performing  in 
accordance  with  predictions  and  to  identify  areas  where 
improvements  can  be  incorporated  to  minimize  degradation, 
improve  R/M/T,  and  reduce  the  LCC 
e  Testability  program  and  requirements  -  To  assure  development  of 
the  fault  detection  and  fault  isolation  capability  that  is  necessary 
to  support  system  reconfiguration,  maintenance  diagnostics,  and 
achievement  of  overall  program  R&M  requirements,  a  testability 
program  should  be  conducted  as  part  of  any  fault  tolerant 
system  development/acquisition 

e  Built-In  Test  analysis  -  The  analysis  of  BIT  features  and  BIT 
equipment  designs  that  will  be  used  to  detect  and  isolate  faults, 
support  redundancy  management,  and  system  reconfiguration  are 
necessary  to  assure  that  the  desired  fault  tolerance  performance 
levels  are  achieved. 

In  addition  to  the  task  application  matrices  contained  in  MIL- 
STD-785,-470  and  -2165,  R&M  system  tailoring  guidance  (based  on  R&M 
requirement  levels  and  design  maturity)  is  also  contained  in  MIL-HDBK- 
338,  Electronic  Reliability  Design  Handbook .  For  these  military  stan¬ 
dards,  additional  specific  application  guidelines  to  fault  tolerant 
system  development  efforts  are  provided  in  the  following  subsections. 

2.1.1  Reliability  Program  Tailoring 

Before  selecting,  tailoring  and  integrating  reliability  tasks  for  a  C3I 
development  program,  the  AF  program  manager  should  refer  to  the  perti¬ 
nent  application  guidance  contained  in  Appendix  A  of  MIL-STD-785.  Some 


2-4 


reliability  tasks  applicable  to  fault  tolerant  system  developments  require 
additional  emphasis  with  regard  to  advancing  their  implementation  sched- 
ule  and  require  a  higher  level  of  effort.  The  MIL-STD-785  task  applica¬ 
tion  matrix ,  modified  for  fault  tolerant  system  developments,  is  shown  in 
Table  2-1.  Reliability  tasks  that  require  additional  emphasis  and  tailor¬ 
ing  in  both  the  Statement  of  Work  (SOW)  and  associated  CDRLs  for  fault 
tolerant  system  developments  are  discussed  ir.  the  paragraphs  that  fol¬ 
low.  In  addition,  a  number  of  other  reliability  tasks,  which  are  imple¬ 
mented  in  the  same  way  for  both  fault  tolerant  and  non-fault  tolerant 
systems,  have  been  included  due  to  their  importance  in  the  formulation 
of  the  overall  reliability  program. 

e  Task  101 ,  Reliability  Program  Plan  -  The  procedures  and  content 
for  this  task  are  the  same  for  fault  tolerant  and  non-fault  toler¬ 
ant  systems.  However,  a  write-up  on  the  task  description  has 
been  provided  since  this  task  is  deemed  to  be  "generally  applica¬ 
ble"  during  all  program  phases  for  fault  tolerant  system  develop¬ 
ment.  The  plan  provides  management  visibility  for  proper  moni¬ 
toring,  control  and  coordination  between  interrelated  design  and 
support  activities.  The  Full-Scale  Engineering  Development 
phase  SOW  should  require  the  contractor  to  develop  specific  fault 
tolerance  questions  for  inclusion  in  the  design  review  checklist. 
It  is  also  recommended  that  this  SOW  include  a  requirement  for 
the  development  of  a  Fault  Tolerance  Test  Plan  which  details 
plans  for  evaluating  and  dem  istrating  how  well  the  design  meets 
fault  tolerance  requirements,  especially  with  regard  to  fault  pro¬ 
tection  coverage  and  fault  recovery  times. 

e  Task  104,  Failure  Reporting,  Analysis  and  Corrective  Action 

System  ( FRACAS )  -  The  stringent  reliability  requirements  atten¬ 
dant  to  fault  tolerant  systems  require  the  early  elimination  of 
failure  causes  since  this  process  is  a  major  contributor  to  relia- 


2-5 


TABLE  M.  Igurog  RsHsEHBy  T— >t  AppIlmlSBi  QuMww  Mrtrto  tor  NmH  TaNrswt  Syitu*. 


MMAMPNMI 

TASK 

T1TU 

TASK 

TYPE 

OONMPT 

VAUO 

PEED 

PROD 

101 

RELIABILITY  PROGRAM  PLAN 

MOT 

■ 

■ 

0 

0 

103 

MONITOR/CONTROL  OP  SUBCONTRACTORS 

BSUPPLIBRS 

MGT 

S 

s 

0 

0 

10S 

PROGRAM  REVIEWS 

MOT 

s 

8(2) 

Q(3) 

GO) 

104 

PAILURI  ABORTING,  ANALYSIS  A 
CORRECTIVE  ACTION  SYSTSM  (FRACAS) 

ENQRG 

NA 

S 

Q 

G 

10B 

PAILURI  REVIEW  OOARO  (PRO) 

MGT 

NA 

8(31 

G 

G 

301 

RELIABILITY  MOOEUNO 

INGRG 

■ 

B 

GO) 

GC(2) 

303 

RELIABILITY  ALLOCATIONS 

ACCT 

■ 

G 

Q 

OC 

303 

RELIABILITY  PREDICTIONS 

ACCT 

BA 

HR 

GI2) 

GC(2) 

304 

PAILURI  MOMS.  IPPICTS.  * 

CRITICALITY  ANALYSIS  IPMECA) 

ENQRG 

■■ 

■B 

0(1)0) 

GC41H2) 

3CS 

SNEAK  CIRCUIT  ANALYSIS  MCA) 

ENGRQ 

NA 

NA 

GO) 

GC(1) 

SOB 

ELECTRONIC  PARTS/CIRCUITS 

TOLERANCE  ANALYSIS 

CNGRG 

NA 

NA 

G 

GC 

307 

PARTS  PROGRAM 

ENQRG 

S 

S(2) 

GO) 

GO) 

308 

reliability  CRITICAL  ITEMS 

MGT 

SID 

■i 

G 

G 

308 

effects  op  f  unctional  testing, 
storage  handling,  packaging, 
transportation  g  maintenance 

ENGRQ 

NA 

S(1) 

G 

GC 

301 

environmental  stress  screening 

(ESS) 

ENQRG 

NA 

s 

G 

G 

303 

RELIABILITY  DEVELOPMENT/ GROWTH 

TESTING 

ENCRG 

NA 

S(2) 

GO) 

NA 

303 

RELIABILITY  QUALIFICATION  TEST 
(ROT)  PROGRAM 

ACCT 

NA 

S(2) 

G(2) 

G(2) 

304 

PRODUCTION  RELIABILITY  ACCEPTANCE 
TEST  (PI  .ATI  PROGRAM 

ACCT 

NA 

NA 

s 

GO) 

NOTE:  PROGRAM  PHASE  APPLICABILITY  CHANGES  FROM  TABLE  A-1  OP 
MIL-STD-78B  ARE  SHOWN  WITH^^BHB^BBBH 


COM  MPINITKMS 


TASK  TYPE 

PROGRAM  PHASE 

ACCT  -  RELIABILITY  ACCOUNTING 

S  - 

SELECTIVELY  APPLICABLE 

ENGRG  -  RELIABILITY  ENGINEERING 

0  - 

GENERALLY  APPLICABLE 

GC- 

GENERALLY  APPLICABLE  TO  MSiGN  CHANGES 

MGT  -  MANAGEMENT 

ONLY 

NA- 

NOT  APPLICABLE 

(1) 

REQUIRES  CONSIDERABLE  INTERPRETATION 
OP  INTENT  TO  BE  COST  EFFECTIVE 

(2) 

MIL-STO-7SB  IS  NOT  THE  PRIMARY 

IMPLEMENTATION  REQUIREMENT.  OTHER 
MIL-STDS  OR  STAHIMENT  OP  WORK  REQUIRE¬ 
MENT?  MUST  BE  INCLUDED  TO  DEFINE  THE 
Rt  7-4537-002 (T)  REQUIREMENTS. 


2-6 


bility  growth  and  attainment  of  acceptable  field  reliability. 
During  Full*Sca!e  Engineering  Development,  the  FRACAS  should 
be  required  to  document  hardwere  anomalies,  software  errors  and 
masked  faults.  This  enhances  the  ability  to  develop  corrective 
action  and  monitor  the  reliability  growth  of  the  system.  The 
procedure  for  implementing  the  FRACAS  on  fault  tolerant  systems 
is  the  same  as  that  used  on  non-fault  tolerant  systems. 

e  Task  201,  Reliability  Modeling  -  The  development  of  reliability 
models  is  mandatory  for  fault  tolerant  system  development  since 
the  evaluation  of  these  models  is  an  integral  part  of  trade  study 
activity  aimed  at  developing  and  selecting  the  lowest  LCC  config¬ 
uration  capable  of  meeting  R/M/T  requirements.  This  task  is 
"generally  applicable"  to  alt  program  phases  since  careful  review 
of  even  the  early  models  can  reveal  states  or  conditions  where 
management  action  may  be  required.  The  mission  success 
probability  model  should  be  developed  to  the  extent  that  informa¬ 
tion  becomes  available  concerning  the  fault  protection/redundancy 
configuration (s),  even  though  numerical  input  data  may  not  be 
available.  Single  point  failure  states,  which  can  cause  premature 
mission  loss  or  unacceptable  safety  hazards,  can  be  readily  iden¬ 
tified  and  targeted  for  additional  design  consideration.  The 
methodology  and  procedures  used  for  fault  tolerant  systems  re¬ 
liability  modeling  differ  from  that  of  other  systems  in  that  analy¬ 
sis  of  fault  tolerant  systems  generally  deals  with  the  much  more 
complex  models  required  to  evaluate  reconfigurable  and  resource 
sharing  configurations. 

Mission  reliability  models  for  evaluating  conventional  series- 
parallel  equipment  configurations  should  be  based  on  the  tech¬ 
niques  described  in  Methods  1001  thru  1004  of  MIL-STD-756. 
Mission  reliability  modeling  of  systems  employing  extensive  hard¬ 
ware  redundancy  and  complex  fault  management,  recovery  and 


2-7 


reconfiguration  techniques  often  requires  sophisticated  evaluation 
tools  that  are  typically  based  on  Markov  analysis  techniques  and 
Monte  Carlo  simulation  methods.  In  addition,  these  tools  and 
models  typically  consider  the  following  situations: 

-  Redundancies  present 

-  Permanent  faults 

-  T ransient  or  intermittent  faults 

-  Effects  of  failure  modes 

-  Propagating  sequences  of  faults 

*  Mission  load  changes 

-  System  response  to  failure  if  fault  protection  coverage 
(consisting  of  detection,  isolation,  recovery  or  reconfig¬ 
uration)  is  less  than  perfect. 

Some  currently  existing  computerized  models  used  for  the  reli¬ 
ability  assessment  of  these  complex  fault  tolerant  systems  are 
ARIES,  CARE  III,  HARP,  and  SURE  (see  para.  5.1.6). 

e  Task  202,  Reliability  Allocations  -  For  fault  tolerant  system  de¬ 
velopments,  this  task  should  be  started  during  the  Concept  Ex¬ 
ploration  phase  in  conjunction  with  the  establishment  of  system 
level  requirements.  Early  management  visibility  of  subsystem 
allocations  may  highlight  the  reasonableness  of  these  system  level 
requirements  and,  if  warranted,  cause  their  reassessment.  In 
later  program  phases,  if  some  of  the  subsystem  and  lower  level 
allocations  appear  to  be  unreasonably  difficult  to  achieve,  then 
the  analysis  becomes  the  basis  for  performing  fault  tolerant  de¬ 
sign  and  redundancy  tradeoffs  among  the  subsystems.  The  sub¬ 
sequent  reallocation  should  provide  lower  equipment  level  re¬ 
liability  requirements/specifications  which  can  reasonably  be 
achieved.  Both  SOW  and  CDRL  requirements  for  reliability  al¬ 
locations  and  predictions  also  require  the  performance  of  task 
201  for  consistency  and  traceability. 


2-8 


•  Task  203,  Reliability  Predictions  *  This  task  it  deemed  to  bo 
"generally  applicable"  during  all  phatas  of  fault  tolerant  system 
development  since,  when  these  predictions  are  coupled  with  the 
models  of  task  201,  the  early  mission  completion  success  prob¬ 
ability  predictions  will  identify  those  subsystems  that  contribute 
a  high  percentapa  to  the  total  probability  of  mission  failure. 
This  process  will  identify  those  areas  requiring  increased  fault 
tolerance  and  where  management  action  may  be  directed  to  yield 
the  highest  payoff.  Early  review  of  reliability  predictions  at  the 
lowest  equipment  levels  will  identify  parts  or  components  which 
may  have  inadequate  margins  between  the  parts  strength  and  the 
expected  applied  stress.  In  addition,  the  earlier  the  review  is 
performed,  the  greater  the  range  of  acceptable  options  for  im¬ 
proving  equipment  reliability.  Whenever  predictions  fall  short  of 
allocated  reliability  requirements,  alternatives  such  as  the  follow¬ 
ing  should  be  considered: 

•  Identify  suitable  higher  reliability  substitutes 

-  Reapportion  reliability  allocations 

-  Redesign  using  higher  reliability  parts  or  more  fault  tolerant 
designs 

•  Decrease  the  severity  of  environments  or  other  operational 
stress  factors. 

Some  alternatives  are  more  feasible  and  acceptable  than  others  at 
given  points  in  development,  but  all  are  easier  and  less  expen¬ 
sive  to  accomplish  earlier  than  later.  Equipment  level  reliability 
predictions  for  fault  tolerant  systems  must  take  into  account  re¬ 
dundancies  present  in  lower  tier  hardware  elements. 

e  Task  204,  Failure  Mode,  Effects  and  Criticality  Analysis  ( FMECA ) 

-  This  task  is  "generally  applicable"  during  all  phases  of  fault 
tolerant  system  development.  In  particular,  imposition  of  this 


task  it  recommended  at  tha  syttam  lavai  during  tha  Concept  Ex¬ 
ploration  phase  and  at  lower  levels,  at  applicable,  during  the 
Demonstration/Vatldation  phase.  The  FMECA  results  should  be 
used  to  confirm  the  validity  of  the  reliability  model  (task  201) 
for  compliance  with  qualitative  fault  tolerance  criteria  (eg.,  fail 
operational/fail-safe  requirements,  etc.),  and  for  computing  re¬ 
liability  estimates  of  subsystems  or  functional  equipment  group¬ 
ings,  particularly  where  redundancy  or  fault  protection  is  pre¬ 
sent.  The  SOW  and  CDRL  must  also  identify  the  equipment  level 
at  which  the  FMECA  is  conducted,  taking  into  consideration  any 
specification  requirements  rotating  to  the  system  level  at  which 
faults  will  or  will  not  be  tolerated.  During  the  Full-Scale  Engi¬ 
neering  Development  phase,  the  FMECA  must  also  be  conducted 
on  highly  mission  critical  systems  with  emphasis  on  relevant 
fault  classes  such  as  transient,  intermittent,  permanent,  latent, 
common  cause  and  catastrophic  failures.  The  procedures  for  im¬ 
plementing  FMECAs  on  fault  tolerant  systems  are  quite  similar  to 
those  used  on  non-fault  tolerant  systems.  However,  where  mul¬ 
tiple  layers  of  redundancy  or  reconfiguration  capability  in  re¬ 
sponse  to  failures  is  provided,  the  FMECA  activity  must  include 
a  review  of  testability  features  to  assure  that  adequate  fault  de¬ 
tection/fault  isolation  capability  exists  to  preclude  fault  propa¬ 
gation  and  support  system  reconfiguration. 

e  Task  208,  Reliability  Critical  Items  -  For  fault  tolerant  system 
development,  it  is  recommended  that  this  task  be  initiated  during 
the  Demonstration/Validation  phase  to  the  extent  that  analysis 
(e.g.  FMECA)  of  system  configurations  has  identified  items 
whose  failure  can  significantly  affect  system  safety,  mission  suc¬ 
cess,  availability  or  total  maintenance/logistics  support  cost. 
Reliability  critical  items,  once  identified  as  a  part  of  the  selected 
configurations,  should  be  retained  and  closed-out  in  subsequent 


2-10 


program  phases.  Reliability  critical  items  which  cannot  be  elimi¬ 
nated  by  design  are  the  prime  candidates  for  additional  analysis, 
growth  testing,  reliability  qualification  testing,  reliability  stress 
analyses,  and  other  techniques  to  reduce  the  systems  reliability, 
availability  or  LCC  risk.  It  is  advisable  to  request  the  prime 
contractor  to  examine  the  list  of  reliability  critical  items  and 
make  appropriate  recommendations  for  additions  and  deletions 
with  supporting  rationale. 

e  Task  303,  Reliability  Qualification  Test  ( RQT )  Program  - 

Reliability  qualification  testing  provides  a  reasonable  assurance 
that  a  subsystems/systems  minimum  acceptable  reliability  require¬ 
ments  have  been  met  before  committing  to  production.  Normally, 
RQTs  of  non-redundant  items  utilize  test  plans  to  statistically 
verify  the  item's  specified  minimum  acceptable  mean-time-between- 
failure  (MTBF).  A  mission-time-between-critical-failure  (MTBCF) 
requirement  contained  in  the  System  Specification  of  a  fault 
tolerant  system  should  be  verified  by  analysis  or  test.  It  is 
recommended  that  AF  program  managers  consider  selectively 
supplementing  MTBF  RQT's  for  fault  tolerant  s'  stem  equipment 
by  requiring  verification  of  MTBCF  requirements  by  demonstration 
test.  This  recommendation  applies  to  highly  mission/safety  crit¬ 
ical  subsystems/systems  which  contain  redundant  equipments  with 
low  MTBFs.  It  also  applies  when  the  complexity  of  the  system's 
fault  tolerant  protection  mechanism  may  limit  confidence  in  ana¬ 
lytical  approaches  to  MTBCF  verification.  However,  the  pres¬ 
ence  of  high  MTBCF  values  or  low  volume  production  may  make 
it  impossible  to  demonstrate  the  MTBCF  with  statistical  confidence. 
In  these  cases,  the  program  manager  should  require  that  the 
MTBCF  be  verified  by  rigorous  analysis  that  includes,  as  ap¬ 
propriate,  the  use  of  a  proven  reliability  model  (see  para.  5.1.6) 
and/or  computer  simulation  techniques.  MIL-STD-781,  Reliability 


2-11 


Design  Qualification  and  Production  Acceptance  Tests:  -  Expo¬ 
nential  Distribution  cannot  bo  used  to  accurately  assets  the  de¬ 
cision  risks  related  to  the  reliability  demonstration  of  fault  toler¬ 
ant/redundant  systems,  since  the  distribution  of  times-to-failure 
of  such  systems  do  not  follow  an  exponential  function.  However, 
a  Monte  Carlo  simulation  program  is  capable  of  solving  this  prob¬ 
lem  and: 

-  Evaluating  and  defining  the  producer  and  consumer  risks  for 
various  system  MTBCF  values 

-  Offering  optional  selection  of  sequential  or  fixed  length  test 
plans 

-  Allowing  evaluation  of  systems  which  operate  under  either  de¬ 
ferred  or  periodic  maintenance  policies. 

A  Monte  Carlo  simulation  program  for  MTBCF  demonstration  tests 
has  been  developed  and  is  described  in  Reliability  Demonstration 
Technique  for  Fault  Tolerant  Systems  (Reference  1). 

Additional  reliability  tasks  for  the  development  of  fault  tolerant  sys¬ 
tems  include  performing  those  trade  studies  and  analyses  required  to  de¬ 
fine  a  reliable  and  supportable  system  architecture  that  is  also  cost  ef¬ 
fective.  It  is  important  that  the  selected  reliability  tasks  be  coordinated 
with  associated  Maintainability,  Testability,  Logistics  Support  and  System 
Safety  tasks  and  analyses. 

2.1.2  Maintainability  Program  Tailoring 

As  described  in  MIL-STD-470A,  Appendix  A,  Section  30,  cost- 
effective  task  selection  and  tailoring  can  materially  aid  in  attaining 
program  maintainability  requirements.  Some  maintainability  tasks,  applic¬ 
able  to  fault  tolerant  system  development,  require  additional  emphasis 
with  regard  to  advancing  their  implementation  schedule  and  requiring  a 
higher  level  of  effort.  The  MIL-STD-470  task  application  matrix,  modi¬ 
fied  for  fault  tolerant  system  developments,  is  shown  in  Table  2-2. 


2-12 


1  i 

s 


TABLE  2*2.  M1L4TD470  MiintstniWIhy  Tuk  Appjiaitiew  Guktono*  Matrix  for  F«uK  Tojgrgnt  Syittmi. 


TITLE 

TASK 

PROGRAM  PHASE 

TASK 

TTrc 

CONCEPT 

VALID 

FSD 

PROO 

OPER  SYSTEM 

DEV  (MOD8) 

101 

MAINTAINABILITY  PROGRAM 
PLAN 

MGT 

G<3) 

G 

0(3X1) 

■ 

102 

MONITOR/CONTROL  OF  SUB¬ 
CONTRACTORS  AND 

VENDORS 

MGT 

N/A 

S 

G 

G 

S 

103 

PROGRAM  REVIEWS 

MGT 

S 

0(3) 

G 

G 

s 

104 

DATA  COLLECTION. 

ANALYSIS  AND  CORRECTIVE 
ACTION  SYSTEM 

ENG 

N/A 

S 

G 

G 

■ 

201 

MAINTAINABILITY  MODELING 

ENG 

S 

8(4) 

Q 

C 

■ 

202 

MAINTAINABILITY 

ALLOCATIONS 

ACC 

C 

■ 

203 

MAINTAINABILITY 

PREDICTIONS 

ACC 

0(2) 

C 

■ 

204 

FAILURE  MODES  AND 

EFFECTS  ANALYSIS  (FMEA) 

MAINTAINABILITY 

INFORMATION 

ENG 

N/A 

S<2) 

(3X4) 

Q(D 

(2) 

C(1) 

(2) 

■ 

205 

MAINTAINABILITY  ANALYSIS 

ENG 

S<3) 

0(3) 

wm 

■ 

■ 

206 

MAINTAINABILITY  DESIGN 
CRITERIA 

ENG 

S(3) 

G 

C 

■ 

207 

PREPARATION  OF  INPUTS  TO 
DETAILED  MAINTENANCE 

PLAN  AND  LOGISTICS 
SUPPORT  ANALYSIS  (LSA) 

ACC 

N/A 

3(2) 

(3) 

0(2) 

C(2) 

■ 

301 

MAINTAINABILITY 
DEMONSTRATION  (MO) 

ACC 

N/A 

S(2) 

0(2) 

0(2) 

S<2) 

NOTE:  PROGRAM  PHASE  APPLICABILITY  CHANGES  FROM  TABLE  A-1  OP 
MIL-5TD-470  ARE  SHOWN  WITH 


CODE  DEFINITIONS 

PROGRAM  FMA3I 

S  •  SELECTIVELY  APPLICABLE 
Q  ■  GENERALLY  APPLICABLE 

C  ■  GENERALLY  APPLICABLE  TO  DESIGN  CHANGES  ONLY 
N/A  •  NOT  APPLICABLE 

REQUIRES  CONSIDERABLE  INTERPRETATION  OF  INTENT  TO  BE  COST  EFFECTIVE. 

MIL-8TD-470  18  NOT  THE  PRIMARY  IMPLEMENTATION  DOCUMENT.  OTHER  MIL-STDS  OR  STATEMENT  OF  WORK 
REQUIREMENTS  MUST  BE  INCLUDED  TO  DEFINE  OR  RESCIND  THE  REQUIREMENTS.  FOR  EXAMPLE 
MIL-8TD-471  MUST  BE  IMPOSED  TO  DESCRIBE  MAINTAINABILITY  DEMONSTRATION  DETAILS  AND  METHODS. 
APPROPRIATE  FOR  TH08E  TASK  ELEMENTS  SUITABLE  TO  DEFINITION  DURING  PHASE. 

DEPENDS  ON  PHYSICAL  COMPLEXITY  OF  THE  SYSTEM  UNIT  BEING  PROCURED,  ITS  PACKAGING  AND  ITS 
OVERALL  MAINTENANCE  POLICY. 

R87-3937-003(T) 


ACC  -  MAINTAINABILITY  ACCOUNTING 
ENG  -  MAINTAINABILITY  ENGINEERING 
MOT  •  MANAGEMENT 


(1) 

(2) 


m 

W 


f 


2-13 


.  .  I 


Systems  managers  incorporating  a  high  degree  of  fault  tolerance  in  their 
designs  should  use  this  matrix  with  emphasis  on  early  program  phases, 
particularly  the  Concept  Exploration  and  the  Demonstration/Validation 
phases .  Effective  and  feasible  concepts  for  Maintainability,  Diagnostics 
and  Maintenance  must  be  developed  and  applied  as  early  as  possible  to 
insure  that  major  alternatives  can  be  examined  tor  overall  impact  on 
performance3  and  LCC  before  system  design  "cast  in  concrete.” 
Earlier  corrective  actions  can  thereby  be  initiated  and  maintainability  in¬ 
puts  can  be  provided  for  evaluating  the  impact  on  mission  reliability, 
readiness,  as  well  as  LCC.  Crucial  issues  such  as  mission- related  and 
large  cost-impact  items  are,  therefore,  addressed  earlier  in  a  more  com¬ 
fortable  and  realistic  time  frame.  For  example,  the  proposed  addition  of 
redundant  subsystem/equipment/modules  may  appear  to  improve  overall 
mission  reliability  with  a  minor  penalty  of  additional  technical,  operational 
or  testability  complexity.  Without  the  timely  maintainability  evaluation  of 
the  critical  design  changes  for  this  added  redundancy,  the  diagnostics 
and  accessibility  of  an  otherwise  easy  access  point  could  be  compromised 
and  result  in  a  severe  impact  to  the  item’s  mean-time-to-repair  capabil¬ 
ity.  Maintainability  tasks  that  require  additional  emphasis  and  tailoring 
in  both  tho  SOW  and  associated  CDRLs  for  fault  tolerant  system  develop¬ 
ments  are  discussed  in  the  paragraphs  that  follow.  In  addition,  a  num¬ 
ber  of  other  maintainability  tasks,  which  are  implemented  in  the  same 
way  for  both  fauit  tolerant  and  non-fault  tolerant  systems,  have  been 
included  due  to  their  importance  in  the  formulation  of  the  overall  main¬ 
tainability  program. 

e  Task  101 ,  Maintainability  Program  Plan  -  The  development  of  a 
maintainability  program  plan  for  fauit  tolerant  systems  should  be 
considered  as  "generally  applicable"  for  all  program  phases  and 
all  system  modifications.  The  plan  is  needed  early  in  the  devel¬ 
opment  cycle  to  define  the  early  concepts  necessary  to  guide  the 
maintainability  program  during  subsequent  program  phases.  The 
primary  objectives  of  a  maintainability  program  are  to  ensure 


2-14 


design  adherence  to  specified  maintainability  parameters  in  an 
environment  of  maintenance,  support,  and  lower  LCC  constraints. 
Diagnostics,  for  instance,  provide  multiple  capabilities  for  redun¬ 
dancy  management,  fault  tolerance,  on-line  performance  monitor¬ 
ing,  and  basic  maintenance  fault  localization  functions.  Identi¬ 
fication  and  analysis  of  risks  play  a  key  role  due  to  the  high 
level  of  uncertainty  that  is  present  early  in  a  system's  life  cy¬ 
cle.  The  maintainability  program  plan  should  identify  all  main¬ 
tainability  analyses  to  be  performed.  These  analyses  are  neces¬ 
sary  to  establish  and  identify  the  risks  involved  in  levels  of  re¬ 
pair,  false  alarm  rates,  proportion  of  faults  detectable,  I  vels  of 
isolation,  and  development  of  external  test  systems.  TLe  main¬ 
tainability  program  plan  should  provide  management  a  description 
of  how  the  contractor  intends  to  satisfy  mission  maintainability 
requirements.  This  task  is  implemented  in  the  same  way  for 
both  fault  tolerant  and  non-fault  tolerant  systems.  It  should  be 
noted  that  the  maintainability  program  plan  may  be  submitted  as 
an  integrated  plan  including  reliability  and  testability. 

•  Task  10H,  Data  Collection.  Analysis,  and  Corrective  Action  Sys¬ 
tem  -  This  task  is  established  to  aid  design,  identify  corrective 
action  tasks  and  evaluate  test  results.  The  data  collection  sys¬ 
tem  should  be  defined  as  early  as  possible,  but  not  later  than 
the  Demonstration/Validation  phase  and  should  be  considered  as 
"generally  applicable"  for  all  system  modifications.  The  data  col¬ 
lection  system  used  during  the  maintainability  demonstration 
should  receive  preliminary  planning  during  the  Demonstration/ 
Validation  phase  and  should  become  firm  in  the  maintainability 
demonstration  plan  prior  to  testing.  The  data  collection  system 
should  be  used  as  a  means  for  identifying  maintainability  design 
problems  and  errors,  and  for  initiating  corrective  actions. 
These  corrective  actions  can  take  the  form  of  modifications  and 


2-15 


changes  to  equipment  maintenance  procedures  and  fault  detection 
and  isolation  features  (hardware  and  software)  to  improve  the 
faults  detectable,  fraction  of  faults  isoiatable,  and  reduce  false 
alarm  rates,  maintenance  induced  faults,  system  outages  and 
excessive  corrective  or  preventive  maintenance  times. 

e  Task  201,  Maintainability  Modeling  -  During  the  Concept  Defini¬ 
tion  and  Demonstration/Validation  phases,  various  fault  tolerant 
system  design  and  support  alternatives  may  be  evaluated  through 
the  use  of  models.  The  models  previously  developed  during  the 
Full-Scale  Development  phase,  should  be  updated  and  used  to 
measure  the  progress  achieved  versus  the  specified  requirements 
and  goals.  These  models  should  also  be  used  to  evaluate  the 
maintainability  impact  of  design  changes.  The  models  may  also 
be  utilized  to  determine  the  impacts  of  changes  in  fault  detection 
probability,  fraction  of  isoiatable  failures,  and  frequency  of 
failures.  For  fault  tolerant  systems  designed  for  an  on-line 
maintenance  concept,  the  maintainability  modeling  task  must  con¬ 
sider  the  effect  of  on-line  maintenance  on  system  performance 
and  the  ability  of  the  system  to  meet  overall  R&M  requirements. 

•  Task  202,  Maintainability  Allocations  -  The  maintainability  allo¬ 
cation  process  is  the  same  for  both  fault  tolerant  and  non-fault 
tolerant  systems.  However,  since  stringent  availability  require¬ 
ments  are  usually  imposed  on  fault  tolerant  systems,  it  is  impor¬ 
tant  that  overall  system  maintainability  objectives  be  translated 
into  maintainability  requirements  for  system  components.  Main¬ 
tainability  is  a  key  factor  affecting  availability  (see  para. 
5.2.1);  accordingly,  maintainability  allocations  should  be  con¬ 
ducted  on  system  elements  suitable  to  definition  during  the  early 
program  phases  when  the  most  flexibility  in  tradeoffs  and  redef- 


2-16 


inition  exists.  Starting  early  also  allows  time  to  establish  lower 
level  maintainability  and  system  level  diagnostic  requirements 
that  can  be  allocated  to  subsystems,  and  diagnostic  requirements 
that  can  be  allocated  to  assemblies.  Also,  the  maintainability 
requirements  must  be  frozen  at  some  point  to  provide  a  baseline 
for  the  designer.  Fault  detection  and  fault  isolation  probabilities 
to  a  given  level  must  be  defined. 

e  Task  203 ,  Maintainability  Predictions  -  The  maintainability  pre¬ 
diction  process  is  the  same  for  fault  tolerant  and  non-fault  toler¬ 
ant  systems.  This  task  should  be  selectively  applied  during  the 
Concept  Exploration  phase  to  evaluate  and  tradeoff  various  fault 
tolerant  system  design  configurations.  However,  during  the 
Demonstration/Validation  and  Full-Scale  Development  phases, 
maintainability  predictions  should  be  used  to  determine  the 
degree  of  compliance  to  specification  requirements.  Up  to  date 
predictions  provide  engineers  and  management  with  essential 
information  on  maintainability  program  progress;  in  addition, 
they  are  important  elements  in  the  program  decision  making 
process.  Since  a  limited  quantity  of  specific  design  data  may  be 
available  during  the  Demonstration/  Validation  phase,  main¬ 
tainability  predictions  must  be  based  largely  on  experience  with 
predecessor  (similar)  systems  and  on  reliable/proven  prediction 
techniques.  During  the  Full-Scale  Development  phase,  maintain¬ 
ability  predictions  can  be  used  to  determine  the  inherent  main¬ 
tainability  characteristics  of  the  proposed  system,  the  effects  of 
proposed  changes  on  maintainability,  and  the  optimum  tradeoff  of 
equipment  characteristics.  Predictions  made  during  this  phase 
are  generally  more  accurate  than  those  made  in  earlier  phases, 
since  more  specific  system  information  is  available. 


2-17 


•  Task  204,  Failure  Mode  and  Effects  Analysis  ( FMEA )  -  A  FMEA 
is  used  to  identify  critical  failure  inodes  and  checks  the  diagnos- 
tic  capability  for  detecting  and  isolating  each  of  these  modes. 
Specifically,  this  capability  relates  to  such  activities  as  the  de¬ 
termination  and  design  of  indices  of  failure,  placement  and  na¬ 
ture  of  test  points,  development  of  troubleshooting  schemes,  and 
the  establishment  of  design  characteristics  and  criteria  for  fault 
detection  and  isolation  at  all  equipment  levels.  The  effectiveness 
of  this  fault  detection  and  isolation  capability  becomes  a  critical 
driver  for  maintainability  design  at  organizational,  intermediate, 
and  depot  maintenance  levels.  Potential  design  weaknesses  which 
seriously  impact  safety,  reliability,  or  maintainability  are  iden¬ 
tified  through  the  proper  use  of  the  FMEA,  BIT/self-test  (ST), 
and  preventive/corrective  maintenance  analyses.  Top-level  FMEA 
activity  should  be  initiated  during  the  Concept  Exploration  phase 
where  only  more  obvious  failure  modes  may  be  identified  since 
design  definition  is  limited.  As  greater  design  and  mission 
definition  becomes  available  during  Demonstration/Validation  and 
Full-Scale  Development  phases,  the  analysis  should  be  expanded 
to  successively  more  detailed  levels  (ie.,  system,  subsystem,  and 
equipment  levels)  and  ultimately  to  the  piece  part  level  if 
warranted  based  on  mission  criticality. 

e  Task  205,  Maintainability  Analysis  -  The  tradeoff  process  requir¬ 
ed  to  pick  the  fault  tolerant  design  best  suited  to  meet  system 
R/M/T  requirements  and  program  LCC  constraints  requires  a  main¬ 
tainability  analysis.  In  general,  this  task  has  four  main 
purposes:  O)  to  establish  design  criteria  that  will  provide  the 
desired  system  features;  (2)  to  allow  for  design  decisions  to  be 
made  through  the  evaluation  of  alternatives  and  through  the  use 
of  tradeoff  studies;  (3)  to  contribute  toward  the  development  of 
maintenance,  repair,  and  servicing  policies  best  suited  to  the 


2-18 


system;  and  (4)  to  verify  that  tha  design  complies  with  maintain¬ 
ability  design  requirements. 

•  Task  206,  Maintainability  Design  Criteria  -  For  the  stringent 
maintainability  requirements  typically  imposed  on  fault  tolerant 
systems  to  be  tailored  into  practical  and  effective  hardware  de¬ 
signs,  it  is  recommended  that  a  broad  spectrum  of  maintainability 
design  criteria  be  defined  and  employed.  Although  the  task 
procedures  are  the  same  as  those  used  for  non-fault  tolerant 
systems,  this  task  should  be  considered  as  "selectively 
applicable"  during  the  Concept  Exploration  phase  and  "generally 
applicable"  for  all  design  changes. 

e  Task  301,  Maintainability  Demonstration  -  Planning  for  this  task 
should  start  no  later  than  the  beginning  of  the  Full-Scale  Devel¬ 
opment  phase.  Maintainability  demonstration  is  the  process  in 
which  a  test  is  conducted  to  show  whether  or  not  an  item  pos¬ 
sesses  satisfactory  maintainability  characteristics.  The  specific 
approach  used  can  range  from  limited  controlled  tests  to  an  ex¬ 
tensive  controlled  field  test  of  the  product.  The  test  methods 
and  requirements  for  the  formal  maintainability  demonstration 
should  be  established  in  accordance  with  MIL-STD-471  and  in¬ 
troduced  in  the  Request  for  Proposal  (RFP).  The  SOW  should 
specify  details  concerning  the  required  nature,  conduct  and  sub¬ 
stance  of  the  test(s)  to  be  performed. 

The  Contracting  Activity  should  determine  the  need,  type  and 
scope  of  the  formal  maintainability  demonstration  test.  The  deci¬ 
sion  should  be  based  on  mission  requirements,  costs  of  tests, 
and  type  of  equipment  being  developed.  A  maintainability  dem¬ 
onstration  does  not  guarantee  achievement  of  the  required  main¬ 
tainability  requirements.  However,  it  will  focus  attention  on  the 
item's  marginal  performance,  particularly  when  the  demonstration 


2-19 


is  structured  properly  to  eveluete  the  maintainability  design  fea¬ 
tures.  In  particular,  if  fault  tolerant  system  requirements  dic¬ 
tate  that  system  operation  continue  while  a  redundant  system  is 
being  maintained,  the  planned  maintainability  demonstration  should 
test  this  capability.  The  Contracting  Activity  should  also  supply 
information  that  is  based  on  operational  and  deployment  con¬ 
straints.  This  provides  the  basis  for  defining  realistic  test 
procedures.  As  *  minimum  this  information  should  include  the 
maintenance  philosophy,  descriptions  of  the  maintenance  environ¬ 
ments,  the  modes  of  operation  for  the  test,  and  the  levels  of 
maintenance  to  be  demonstrated. 

For  fault  tolerant  systems,  the  above  stated  maintainability  tasks 
should  be  coordinated  with  associated  Reliability,  Testability,  Human 
Factors,  Logistic  Support,  and  System  Safety  tasks  and  analyses. 

2  1.3  Testability/Diagnostic  Program  Tailoring 

Before  selecting,  tailoring,  and  integrating  testability/diagnostic 
tasks  into  a  C*l  system  program,  the  AF  program  manager  should  review 
the  testability  program  application  guidance  included  in  Appendix  A  of 
MIL-STD-2165  and  in  particular,  the  System  Flow  Diagram  illustrated  in 
Figure  1  therein.  For  convenience,  the  System  Testability  Program  Flow 
Diagram  is  reproduced  as  Fig.  2-1  herein.  The  MIL-STD-2165  task  ap¬ 
plication  matrix,  modified  for  fault  tolerant  system  developments,  is 
shown  in  Table  2-3.  Both  Fig.  2-1  and  Table  2-3  are  relevant  to  most 
applications,  including  systems  that  incorporate  a  high  degree  of  fault 
tolerance. 

The  testability  design  process  must  take  into  account  both  spatial 
and  temporal  considerations  for  fault  detection.  In  particular,  the 
failure  detection  approach  selection  must  be  based  upon  the  requirement 
for  maximum  acceptable  failure  latency.  Continuous  failure  detection 
techniques  should  be  used  to  monitor  those  functions  which  are  mission 
critical  and/or  affect  safety  and  where  protection  must  be  provided 


2-20 


CONCEPT 

EXPLORATION 

PHASE 


ISA  PROCESS 
AND 

MAMTAINABIUTY 

ANALYSIS 


REQUIRING 

AUTHORITY 

I - 1 

|  PROGRAM  | 
!  INITIATION  I 
J^DOCUMENTJ 


i 


DAY  PHASE 


LSA  PROCESS 
AND 

MAINTAINABILITY  ANALYSIS 


201.2.1  PROVIDE  INPUTS  TO 
SUPPORTABHJTY  ANALYSIS 
(TRACK  TEST  TECHNOLOGY. 
IMPOSE  STANDARDIZATION, 
IDENTIFY  PROBLEM  AREAS) 

201.2.2  SET  BIT  AND  0FF4JNE 
TEST  OBJECTIVES 

201.2.3  ESTABLISH  CONSTRAINTS 


i 


SPECIF 


102.2.1  REVIEW  TESTABILITY 
PROGRAM  DURING  SD 


201.2.4  EVALUATE  ALTERNATIVE 
DIAGNOSTIC  CONCEPTS 

201.2.5  ESTABLISH  BIT  GOALS 
AND  THERESHOLOS 

201.2.6  PREPARE  SYSTEM  SPECIFICATION 

201.2.7  ALLOCATE  TO  ITEM  SPECIFICATIONS 


I 


201.4.2  DOCUMENT 
TESTABILITY 
REQUIREMENT! 
TRADEOFFS  « 


R«7-3537-0*5(l/a>(T, 


PERFORM  TESTABILITY  DESIGN  AND 
EVALUATION  OF  SELECTED  DAV  ITEMS 


101X1 

PREPARE  TESTABILITY 

PROGRAM  PLAN 

202.2.1  IMPOSE  TESTABILITY 
DESIGN  DISCIPLINE 


1 

I 

I 

) 

I 


Figure  2-1.  Syitwn  Testability  Program  Flow  Diagram. 
(Sheet  1  o?  2) 


2-21/22 


NOTE: 

f - ^ 

I  TASKS  WITHIN  DOTTED  BOX  | 

|  ARE  NOT  PART  OF  MIL-STD-2165  j 

I _ I 


R27-3537-025(2/2)(T) 


t 


I 


I 

u 


I 

KOTUU  IBM  1 

MAINTENANCE  ACTION  I 

DAIA  COLLECTION  SYSTEM  I 

J 


I 

U 


ESTABLISH 
PRODUCTION  TEST 
(MIA  COLLECTION  SYSTEM 


- 1 

PLAN  AND  | 

CONDUCT  I 

MAINTAINABILITY 
DEMONSTRATION 

.J 


l _ 


M  REVIEW  TESTABILITY 
PROGRAM  DURING  CDR 


203.2.8  INCORPORATE 
TESTABILITY 
FIGURES  AS 
REQUIRED 


r— *  — 

I  INCORPORATE 
>|  TESTABILITY 
|  CORRECTIVE 
|  ACTION 
|  AS  REQUIRED 


I 


M 

n-L - 1 - T 

j  DEVELOP  t"\ 

L.—  j  *  i 

“•J  TEST  PROGRAM  SETS  J  f 

Figur*2-1.  Sytfm  Twtibllity  Program  Flow  Diagram 
iShMt  2  of  2) 


2-23/24 


TABLE  2-3.  MIL-STD-2165  Tiitabtllty  T«»k  Application*  Guidance  Matrix 
for  Fault  Toltrtnt  Syitirm. 


TITLE 

PROGRAM  PHASE 

CONCEPT 

FSD 

PROD 

■H 

101 

TESTABILITY  PROGRAM 
PLANNING 

■ 

■ 

H 

NA 

102 

TESTABILITY  REVIEWS 

G(1) 

s 

103 

TESTABILITY  DATA  COLLEC¬ 
TION  AND  ANALYSIS 

PLANNING 

NA 

S 

i 

G 

201 

TESTABILITY  REQUIREMENTS 

G(1) 

G 

NA 

202 

TESTABILITY  PRELIMINARY 
DESIGN  AND  ANALYSIS 

NA 

S 

H 

s 

1 

TESTABILITY  DETAIL  DESIGN 
AND  ANALYSIS 

NA 

s 

G 

S 

301 

TESTABILITY 

DEMONSTRATION 

NA 

s 

G 

S 

NOTE:  PROGRAM  PHASE  APPLICABILITY  CHANGE  FROM  TABLE  I.  APPENDIX  A 
OF  MIL-STD-2165  IS  SHOWN  WITHl- 


CODE  DEFINITIONS 


CONCEPT  -  CONCEPT  EXPLORATION  S  -  SELECTIVELY  APPLICABLE  TO 

HIGH  RISK  ITEMS  DURING  D&V, 

D&V  -  DEMONSTRATION  &  VALIDATION  OR  TO  DESIGN  CHANGES  DURING 

PROD. 

FSD  -  FULL  SCALE  DEVELOPMENT 

G-  GENERALLY  APPLICABLE 

PROD  -  PRODUCTION  &  DEPLOYMENT 

NA-  NOT  APPLICABLE 


(1)  MIL-STD-1388  IS  PRIMARY  IMPLEMENTATION  DOCUMENT  F.'R  DIAGNOSTIC  REQUIREMENTS 
TRADEOFFS  AND  REVIEW  AS  PART  OF  LOGISTICS  SUPPORT  ANALYSIS  DURING  CONCEPT 
EXPLORATION  PHASE. 

R«7-3337-004(T) 

against  the  propagation  of  errors  through  the  system.  Periodic  testing 
may  be  used  for  monitoring  those  functions  which  provide  backup/  standby 
capabilities  or  are  not  mission  critical.  On-demand  testing  is  typically 
used  for  monitoring  those  functions  which  require  operator  interaction, 
sensor  stimulation,  etc.,  or  which  are  not  easy,  safe,  or  cost-effective 
to  initiate  automatically.  The  maximum  permitted  latency  for  failure 
detection  determines  the  frequency  at  which  diagnostic  procedures  should 
be  run  and  should  take  into  account  function  criticality,  failure  rate, 
possible  wear  out  factors,  and  the  selected  maintenance  concept. 


2-25 


Current  C3I  systems  are  capable  of  achieving  high  levels  of  fault 
detection  coverage  by  utilizing  BIT  and  manual  aided  operational  tests. 
However,  the  premise  of  using  the  same  test  for  similar  elements  may  lull 
managers  into  thinking  that  redundant  elements  will  not  add  to  the  com¬ 
plexity  of  the  equipment's  BIT,  ATE,  or  manual  fault  detection/isolation 
techniques.  Unless  potential  problem  areas  are  cited  well  in  advance  and 
excellent  fault  tolerant  techniques  are  included,  it  might  become  impossi¬ 
ble  to  meet  stringent  fault  detection  coverage  demands.  Examples  of  such 
problem  areas  are  given  in  para.  4.2.1  of  this  document.  Testability 
tasks  that  require  additional  emphasis  and  tailoring  in  both  the  SOW  and 
associated  CDRLs  for  fault  tolerant  system  developments  are  discussed  in 
the  paragraphs  that  follow.  In  addition,  a  number  of  other  testability 
tasks  which  are  implemented  in  the  same  way  for  both  fault  tolerant  and 
non-fault  tolerant  systems  have  been  included  due  to  their  importance  in 
the  formulation  of  the  overall  testability  program. 

•  Task  101 ,  Testability  Program  Planning  -  Although  the  proce¬ 
dures  for  developing  and  implementing  a  testability  program  are 
the  same  for  both  non -fault  tolerant  and  fault  tolerant  systems, 
the  success  of  the  latter  is  heavily  dependent  upon  inherent 
testability  features  (see  para.  4.2.1).  Therefore,  the  imposition 
of  this  task  is  recommended  during  the  Concept  Exploration 
phase  since  it  provides  management  visibility  for  monitoring, 
control  and  coordination  of  testability  design  considerations  be¬ 
tween  interrelated  design  and  support  activities.  Submitted  at 
the  beginning  of  the  Demonstration/Vatidation  phase,  the  test¬ 
ability  program  plan  should  highlight  the  methodology  to  be  used 
in  establishing  qualitative  and  quantitative  testability  require¬ 
ments  for  the  system  specification.  The  plan  should  also  de¬ 
scribe  the  methodology  to  be  used  in  allocating  quantitative  sys¬ 
tem  testability  requirements  down  to  the  subsystem  or  configura¬ 
tion  item  level.  In  order  to  establish  and  maintain  an  effective 
testability  program,  the  maintainability  manager  must  form  a 
close  liaison  with  all  design  disciplines,  and  must  be  prepared  to 
work  aggressively  with  design  engineers  to  ensure  a  proper  bal- 


2-26 


ance  between  performance,  cost  and  supportability.  It  should  be 
noted  that  the  testability  program  plan  may  be  submitted  with 
the  reliability  and  maintainability  program  plans. 

e  Task  103,  Testability  Data  Collection  and  Analysis  Planning  - 
Although  much  of  the  actual  collection,  subsequent  analysis  of 
data,  and  resulting  corrective  actions  may  occur  beyond  the  end 
of  the  program  phase  under  which  the  testability  design  effort  is 
performed,  it  is  essential  that  the  planning  for  this  task  be  ini¬ 
tiated  in  the  Full-Scale  Development  phase,  preferably  before  the 
critical  design  review  (CDR).  A  plan  should  be  developed  for 
the  analysis  of  production  test  results  and  maintenance  actions 
for  fielded  systems  to  determine  if  BIT  hardware  and  software, 
ATE  hardware  and  software,  and  maintenance  documentation  meet 
the  specifications  in  terms  of  fault  detection,  fault  resolution, 
false  indications,  fault  detection  times,  and  fault  isolation  times. 
Also,  all  data  collection  requirements  should  be  defined  to  meet 
the  needs  of  the  testability  analysis.  The  data  collected  should 
include  a  description  of  relevant  operational  anomalies  and  main¬ 
tenance  actions.  Data  collection  should  be  integrated  with 
similar  data  collection  procedures,  such  as  those  for  reliability 
and  maintainability,  and  Logistic  Support  Analysis  and  should  be 
compatible  with  specified  data  systems  in  use  by  the  military  us¬ 
er  organization.  This  task  is  implemented  in  the  same  way  for 
both  fault  tolerant  and  non-fault  tolerant  systems. 

•  Task  201,  Testability  Requirements  -  Accomplishment  of  this  task 
is  recommended  during  the  Concept  Exploration  and  the  Demon¬ 
stration/Validation  phases.  The  testability  requirements  for  this 
task  are  to  establish  and  identify  the  risks  and  uncertainties  in¬ 
volved  in  determining  the  performance  monitoring,  BIT,  level  of 
fault  tolerance,  repair  verification,  fault  detection/isolation,  test 
points,  and  off-line  test  objectives  for  both  fault  tolerant  and 


2-27 


non-fauit  tolerant  systems.  Establishing  performance  require¬ 
ments  at  the  system  and  subsystem  level  should  include  specific 
numeric  performance  requirements  imposed  by  the  procuring 
activity  such  as: 

-  Maximum  allowable  time  between  the  occurrence  of  a  failure 
condition  and  the  detection  of  the  failure  for  each  mission 
function 

-  Maximum  allowable  occurrence  of  system  downtime  (usually 
specified  in  percent)  due  to  erroneous  failure  indications 
(false  alarms) 

-  Maximum  allowable  system  downtime  due  to  corrective  mainte¬ 
nance  actions  at  the  organizational  level. 

Testability  requirements  should  also  include  the  evaluation  and 
identification  of  alternative  diagnostic  concepts  which  include 
varying  degrees  of  BIT,  manual  and  off-line  automatic  testing 
and  diagnostic  test  points.  These  will  determine  the  sensitivity 
of  system  readiness  parameters  to  variations  in  key  testability 
parameters  which  include  BIT  fault  detection,  fault  isolation  and 
false  alarm  rates. 

Task  202,  Testability  Preliminary  Design  and  Analysis  -  It  is 
recommended  that  this  task  be  performed  during  the  Demonstra¬ 
tion/Validation  phase,  modified  during  the  Full-Scale  Development 
phase,  and  utilized  to  determine  quantitative  testability  require 
ments  that  are  achievable,  affordable,  and  adequately  support 
system  operation  and  maintenance.  The  testability  design  tech¬ 
niques  in  this  task  focus  primarily  on  the  compatibility  between 
the  item  and  its  off-line  test  equipment,  the  BIT  (hardware  and 
software)  provided  in  the  item  to  detect  and  isolate  faults,  and 
the  structure  of  the  item  in  terms  of  partitioning  for  enhanced 
fault  isolation  and  detection.  Testability  design  techniques, 
must  be  closely  coordinated  with  the  fault  tolerant  designs,  and 


>«?,»  !|k»t  *«WBI* 


^engafe- 'Wvji 


should  provide,  for  the  independent  testing  of  redundant  circuit¬ 
ry.  fault  assessment,  reconfiguration  into  degraded  modes,  and 
configuration  verification  should  make  maximum  use  of  equipment 
redundancy  and  functional  redundancy  to  assist  in  testing.  The 
testability  design  techniques  in  this  task  will  be  further  refined 
and  implemented  in  task  203. 

e  Task  203,  Testability  Detail  Design  and  Analysis  -  Detailed 
testability  design  is  an  important  aspect  of  the  design  process 
for  fault  tolerant  systems  since  inherent  testability  features  will 
ultimately  control  hardware  redundancies.  This  task  should  be 
accomplished  in  the  Demonstration/Validation  and  Full-Scale  De¬ 
velopment  phases  to  incorporate  testability  design  features  (in¬ 
cluding  BIT)  into  a  system  or  equipment  design  which  will  satis¬ 
fy  both  testability  and  overall  system  fault  tolerance  require¬ 
ments.  This  analysis  should  identify  the  failures  of  each  compo¬ 
nent  and  the  failures  between  components  which  correspond  to 
the  specified  failure  modes  of  each  equipment  to  be  tested. 
These  failures  represent  the  predicted  failure  population  and  are 
the  basis  for  test  derivation  (BIT  and  off-line  test)  and  test  ef¬ 
fectiveness  evaluation.  A  FMEA  from  task  204  of  MIL-STD-470 
should  be  fully  utilized  as  required.  The  FMEA  requirements 
may  need  to  be  modified  or  supplemented  to  provide  the  level  of 
detail  needed.  Analysis  should  be  performed  to  identify  the 
inherent  levels  of  BIT  fault  detection  and  isolation  in  the  design 
of  the  overall  sysum.  The  false  alarm  rate  for  the  overall 
system  should  also  be  determined  by  analysis.  These  capabilities 
should  be  compared  to  the  requirements  to  see  if  they  are 
suitable  and  adequate  for  the  proposed  design.  System-level 
BIT  hardware  concepts  and  software  architectures  should  be  de¬ 
veloped  prior  to,  or  while  integrating,  the  BIT  capabilities  of 
each  subsystem/item.  The  procedures  for  conducting  this  task 
on  fault  tolerant  systems  are  the  same  as  those  used  for 
non-fault  tolerant  systems. 


2-29 


i 


Task  301 ,  Testability  Demonstration  -  The  test  methods  and  re¬ 
quirements  for  the  testability  demonstration  tests  should  be  es¬ 
tablished  in  accordance  with  MIL-STD-471A  (Notice  2).  Through 
the  development  of  the  testability  demonstration  plan,  the  items 
to  be  demonstrated  under  the  maintainability  demonstration  and 
the  Test  Program  Set  (TPS)  demonstration  may  be  coordinated 
(e.g.,  some  common  faults  inserted)  so  as  to  provide  data  on  the 
correlation  of  BIT  and  off-line  test  results.  This  can  give  an 
early  indication  of  possible  "cannot  duplicate"  (CND)  problems  in 
the  field.  The  false  alarm  rate  (an  important  testability  parame¬ 
ter)  is  difficult  to  measure  in  the  controlled  environment  of  a 
demonstration  procedure,  if  the  false  alarm  rate  was  relatively 
high,  it  would  be  possible  to  make  use  of  a  reliability  demonstra¬ 
tion  procedure  from  MIL-STD-781  to  demonstrate  the  false  alarm 
rate,  treating  each  false  alarm  as  a  relevant  failure.  In  most 
cases,  however,  the  rate  will  be  low  and  almost  impossible  to 
verify.  Analytical  techniques  must  then  be  employed.  The  en¬ 
vironmental  conditions  during  the  demonstration  should  (if  possi¬ 
ble)  be  indicative  of  the  expected  operational  environment  in  or¬ 
der  to  expose  the  equipment  to  realistic  stresses. 

Typical  methods  available  to  insert  faults  into  the  equipment  in¬ 
clude  disconnecting  leads  to  simulate  opens,  grounding  pins  to 
simulate  shorts,  inserting  known  faulty  parts,  removing  circuit 
cards  or  wires,  and  by  replacing  a  good  part,  circuit,  or  assem¬ 
bly  with  an  identical  item  known  to  possess  a  particular  type 
failure.  The  appropriate  mix  of  these  or  other  fault  insertion 
techniques  to  be  used  in  a  testability  demonstration  depends  upon 
the  specific  design. 

Even  with  a  reasonably  large  sample  of  inserted  faults,  a  demon¬ 
stration  can  yield  only  limited  data  on  actual  test  effectiveness. 
However,  a  demonstration  is  also  useful  in  validating  some  of  the 


assumptions  and  models  used  during  the  earlier  testability  analy¬ 
sis  and  prediction  efforts  (task  203)  which  may  have  been  based 
upon  a  much  larger  fault  set.  If  certain  assumptions  or  models 
are  invalidated  by  the  demonstration,  appropriate  portions  of 
task  203  should  be  repeated  and  new  predictions  should  be 
made. 

For  fault  tolerant  systems,  it  is  recommended  that  the  scope  of 
the  testability  demonstration  be  expanded  or  integrated  with  a 
fault  tolerance  verification  test.  The  purpose  of  the  test  is  to 
evaluate  and  demonstrate  how  well  the  system  fault  tolerant 
design  meets  requirements  with  respect  to  fault  protection  cover¬ 
age,  fault  recovery  time,  fault  types  to  be  tolerated,  maximum 
allowable  missing  data,  maximum  allowable  corruption  of  data  and 
false  alarm  constraints.  The  CDRL  for  the  testability/fault  tol¬ 
erance  demonstration  plan  should  require  identification  of  the 
analysis,  simulation  and  testing  procedures  required  to  develop 
objective  evaluation  and  acceptance  criteria  for  the  systems  test¬ 
ability  and  fault  tolerant  design  features. 

For  fault  tolerant  systems,  the  above  Testability  tasks  should  be 
coordinated  with  associated  Reliability,  Maintainability,  Logistic  Support, 
and  System  Safety  tasks  and  analyses. 

2.1.4  Software  Program  Tailoring 

Software  is  a  major  system  development  driving  element.  Because 
of  the  Importance  of  software  in  successfully  attaining  system  perform¬ 
ance,  fault  detection,  fault  isolation  and  reconfiguration,  the  systems 
manager  must  plan,  organize,  and  control  the  software  project.  Al¬ 
though  software  program  tailoring  is  implemented  the  same  way  for  a 
fault  tolerant  system  as  for  any  other  system,  this  section  has  been  in¬ 
cluded  due  to  the  importance  of  software  in  the  system  life  cycle.  DoD- 
STD-2167  contains  requirements  for  the  development  of  mission-critical 


2-31 


I 


computer  system  software.  It  establishes  a  uniform  software  development 
process  which  is  applicable  throughout  the  system  life  cycle.  It  incor¬ 
porates  practices  which  have  been  demonstrated  to  be  cost-effective  from 
*  life  cycle  perspective,  based  on  information  gathered  by  the  DoD  and 
industry.  Essential  software  development  process  activities  that  must  be 
considered  include  the  following: 

e  Project  organization  and  planning  with  special  emphasis  on  the 
Software  Development  Plan 

e  Resource  estimation  and  allocation  including  cost,  schedule,  and 
staff 

e  Required  document  preparation  and  delivery 

e  Project  monitoring  and  control 

•  Independent  review  and  assessment  of  design 

e  Test  and  certification. 

2. 1.4.1  Management  Organization  and  Planning  Considerations  -  The  key 
to  successful  software  management  is  a  clear  understanding  of  the  scope 
of  the  project  and  early  emphasis  directed  at  clarifying  the  require¬ 
ments,  the  deliverables,  and  the  organizational  framework.  Proper  at¬ 
tention  to  these  areas  will  ensure  the  system  manager  controls  the  salient 
elements  that  affect  project  planning.  Critical  questions  the  manager 
must  •*  "dress  include: 

•  What  functions  must  the  system  perform? 

•  With  what  other  systems  will  this  system  interact? 

•  What  documents,  programs  and  files  are  specified  as  deliverable 
products? 

•  What  criteria  will  be  used  to  judge  the  acceptability  of  the  final 
product? 

«  What  the  procedure  for  incorporating  requirements  changes 
tha1:  affect  the  scope  of  the  work? 

•  Who  are  key  contact  people  from  the  customer,  developer,  and 
support  groups? 

•  Do  dif  nt  groups  understand  their  areas  of  project  respon¬ 
sibility  t 


2-32 


•  Where  will  the  development  work  be  done? 

e  Which  development  computers  will  be  used? 

2. 1.4. 2  The  Software  Development/ Management  Plan  -  The  software 
development/management  plan  provides  a  disciplined  approach  to  organiz¬ 
ing  and  managing  the  software  project.  A  successful  plan  provides  a 
structured  checklist  of  important  questions,  consistent  documentation  for 
project  organization,  a  baseline  reference  with  which  to  compare  actual 
project  performance  and  experiences,  and  a  detailed  explanation  of  the 
management  approach  to  be  used. 

By  completing  the  plan  early  in  the  program,  the  manager  becomes 
familiar  with  the  essential  steps  for  organizing  the  development  effort, 
e.g.,  estimating  resources,  establishing  schedules,  assembling  a  staff, 
and  setting  railestones. 

2. 1.4. 3  Resource  Estimation  and  Allocation  -  Two  of  the  most  critical 
resources  are  development  staff  and  time.  The  software  manager  is  con¬ 
cerned  with  how  much  time  will  be  required  to  complete  the  project  and 
what  staffing  level  will  be  necessary  over  the  development  cycle. 

Table  2-4  is  provided  to  give  insight  into  a  typical  distribution  of 
schedule  and  personnel  effort  in  generic  terms. 


TABLE  2-4.  Typical  Distribution  of  Software  Pwjopnwrt  Schedule  A 
PerionnerEffiyt  By  Phare. 


PHASE 

PERCENT  OF 

TIME  SCHEDULE 

PERCENT 

OF  EFFORT 

Requirements  Analysis 

S 

6 

Preliminary  Design 

10 

8 

Detailed  Design 

15 

18 

Implementation 

40 

45 

System  lasting 

20 

20 

Acceptance  lasting 

JO 

5 

R87-3S37-00S(T) 

100 

100 

2-33 


2. 1.4. 4  R*quir«d  Document  Preparation  and  Delivery  *  Documents  and 
deliverables  provide  an  ongoing  system  description  and  serve  as  key  in¬ 
dicators  of  progress.  They  are  a  major  concern  of  software  managers 
because  they  mark  the  transitions  between  life-cycle  phases.  Table  2-5 
contains  a  list  of  DoD-STD-2167  documents  and  deliverables  that  are  of 
specific  interest  to  the  software  manager.  With  the  exception  of  the 
Software  Requirements  document,  ail  documents/deliverable  identified  as 
requiring  management  emphasis  are  also  listed  in  DoD-STD-2167  as  being 
the  primary  responsibility  of  management.  The  Software  Requirements 
Specification  warrants  management  emphasis  since  more  than  half  of  all 
software  errors  that  occur  are  traceable  back  to  a  misstatement  of  soft¬ 
ware  requirements. 


TABLE  2-5.  OoO-STD-2167  Software  Documentation  Requirement*  Matrix. 


DOCUMENT 

TAM  PRIMARY 
RESPONSIBILITY 

SOFTWARE 

MANAGEMENT 

EMPHASIS 

System  /Segment  Specification* 

ENG(R) 

Software  Development  Plan* 

MGT 

V 

Software  Configuration  Management  Plan 

MGT 

< 

Software  Quality  Evaluation  Plan 

MGT 

< 

Software  Requirement*  Specification* 

ENG(R) 

V 

Interface  Requirement*  Specification 

ING(R) 

Software  Standard*  and  Procedure*  Manual 

MGT 

V 

Software  Top  Level  Design  Document* 

ENG  ID) 

Software  Detailed  Design  Document* 

ENG(D) 

Interface  Design  Document 

ENC(D) 

Databate  Detign  Document 

ENG(D) 

Software  Product  Specification  * 

ENG(D) 

Venlon  Description  Document* 

ENG(D) 

Software  Teat  Plan* 

MGT 

V 

Software  Teat  Description* 

ENGfT) 

Software  Ten  Procedure* 

ENGfT) 

Software  Test  Rsport* 

ENGtT) 

Computer  System  Operator's  Manuel 

SUP 

Software  User's  Manual 

SUP 

Computer  System  Diagnostic  Manual 

SUP 

Software  Programmer's  Manual 

SUP 

Firmware  Support  Manual 

SUP 

Operational  Concept  Document  * 

ENG 

Computer  Resource*  Integrated  Support  Document 

MGT 

Configuration  Management  Plan 

MGT 

V 

Engineering  Change  Proposal  * 

COM 

Specification  Change  Notice  * 

CDM 

LEGEND: 

ENOW)  Engineering  Requirements  MGT  Manmement 

ENQ(D)  Engineering  Dwign  COM  Configuration  Data  Management 

ENGfT)  Engineering  Test  SUP  Suppliers  of  the  Computer  System 

RS74011-OQ1  *  Document  Ueuaily  Required 

|  R«7-MJ7-00«m 

2-34 


2. 1.4. 5  Project  Monitoring  and  Controlling  Tool*  -  A  tool  in  tha  soft- 
wart  anvironmant  is  any  instrument  that  supports  tha  softwara  produc¬ 
tion  affort.  For  example,  tha  softwara  development/management  plan, 
tha  coat  astimstion  procedure,  and  tha  projaet  notebook  can  ba  ctassifiad 
as  management  tools.  As  a  minimum  thasa  tools  can  ba  utilized  for  soft¬ 
wara  configuration  managamant,  projaet  cost  control  and  a  projaet  his¬ 
tories  data  basa. 

2. 1.4. 6  Indapandant  Raviaw  and  Assessment  of  Design  -  In  tha  current 
era  of  softwara  intensive  weapon  systems,  success  of  the  system  requires 
proper  operation  of  both  hardware  and  softwara.  Discovering  errors 
early  in  the  software  life  cycle  yields  a  substantial  cost  savings.  Soft¬ 
ware  errors  uncovered  during  the  design  phase  are  5  to  10  times  less 
costly  to  correct  than  errors  discovered  during  unit  and  integration  test¬ 
ing.  The  processes  utilized  to  design  and  build  high  reliability  software 

are  analogous  in  many  ways  to  hardware  techniques.  Areas  of  commonali¬ 
ty  include  using  skilled  sanier  personnel  in  high  risk,  critical  areas, 
in-depth  design  review*  by  independent  personnel,  and  extensive  test¬ 
ing.  It  should  be  noted  that  Software  Quality  Assurance  activities  pro¬ 
vide  an  auditing  function  (similar  to  Hardware  Quality  Assurance)  and  is 
not  a  substitute  for  an  independent  design  review. 

2. 1.4. 7  Testing  and  Certification  -  Both  testing  and  certification  arc 
methods  used  to  ensure  quality  in  the  delivered  software.  Testing  iden¬ 
tifies  defects  so  the  software  can  be  revised  before  it  is  released.  Cer¬ 
tification  subjects  the  product  and  process  to  independent  inspection  and 
evaluation.  Certification  is  a  statement  that  some  requirement  has  been 
met  by  the  product  or  process. 

2. 1.4.8  Software  Development  Guidelines  for  Testability  -  The  software 
which  makes  a  design  fault  tolerant  (error  processing  routines,  confi¬ 
dence  tests,  error  detection/correction  techniques,  etc.)  may  be  con¬ 
tained  In  the  operational  and/or  BIT  software.  Guidelines  for  the  pre¬ 
liminary  design  and  analysis  of  software  BIT  can  be  found  in  the  Testa- 


2-35 


bility  Program  Application  Guidance  of  MIL-STD-2165,  Appendix  A,  para. 
50.6.7.  In  addition,  memory  sizing  approximations  for  error-correcting 
techniques  or  reconfiguration  strategies,  should  include  memory  require¬ 
ments  for  subroutines  dealing  with  equipment  and  personnel  safety  (opera¬ 
tor  alert  and  instruction)  in  the  event  of  certain  failure  modes.  A  list 
of  these  failure  modes  may  be  acquired  from  the  FMEA  effort  and  are  very 
helpful  in  pointing  out  critically  important  areas  of  BIT  diagnostic 
routines. 

2.2  PROGRAM  PLANNING  AND  MANAGEMENT  CHECKLIST  QUESTIONS 

Unless  specifically  noted,  the  checklist  questions  apply  both  to  AF 
and  contractor  program  managers.  These  questions  are  particularly  ap¬ 
plicable  at  the  SRR,  POR  and  CDR  to  supplement  the  R&M  evaluation 
criteria  listed  in  MIL-STD-1521,  Technical  Reviews  and  Audits  for  Sys¬ 
tems ,  Equipment  and  Computer  Software.  Questions  primarily  addressed 
to  the  Procuring  Activity  are  followed  by  a  (PA).  Those  addressed  to 
integrating  or  prime  contractors  are  followed  by  a  (C) . 

a.  Has  applicable  R/M/T  program  tailoring  and  application  guidance 
(see  para.  2.1.1,  2.1.2  and  2.1.3)  been  incorporated  in  the  SOW 
requirements  for  a  fault  tolerant  system  development  program? 
(PA) 

b.  Have  the  levels  of  analysis  and  schedule  for  reliability,  main¬ 
tainability,  testability,  safety  and  logistics  tasks  been  consis¬ 
tently  specified,  coordinated  and  integrated? 

c.  Does  the  requirement  to  develop  mission  reliability  models  for 
highly  fault  tolerant  systems  include  a  listing  of  existing  compu¬ 
terized  models  that  are  suitable  to  this  analysis?  (PA) 

d.  Has  a  requirement  for  the  identification  of  the  level  at  which 
faults  can  and  cannot  be  tolerated  been  specified? 

e.  Has  software  project  planning  been  accomplished? 

f.  Are  the  bidder's  software  development  plans  realistic  In  terms  of 
the  size  of  development  staff  and  schedule?  (PA) 


2-36 


0.  Art  project  history  data  baaaa  availabla  to  aaalat  managers  In 
aaaaaalng  performance  and  recognizing  problems? 
h.  Hat  eonaldaratlon  boon  givan  to  Indapandant  ravlow  and  asseas- 
mant  of  high  risk  critical  design  aroaa? 

I.  Do  tha  bidder's  plans  raflact  adequate  resources  allocated  to  soft¬ 
ware  testing?  (PA) 

NOTE:  Primary  Responsibility  Codes  -  (PA)  s  Procuring  Activity 

(C)  «  Prime  Contractor 
All  others  =  Both 


2.3  SPECIFICATION  OF  FAULT  TOLERANCE  AND  R/M/T  REQUIRE¬ 
MENTS 

Program  managers  must  ensure  that  specification  requirements  for 
fault  tolerance  are  developed  as  soon  as  practicable,  preferably  during 
tha  Concept  Exploration  phase.  The  requirements  should  be  further  de¬ 
veloped  and  refined  during  the  Demonstration/Validation  and  FSED 

phases  to  ensure  that  production  hardware  will  contain  the  R/M/T  attri- 

3 

butes  necessary  for  the  C  I  application.  Quantitative  requirements 
should  be  used  in  conjunction  with  specific  R/M/T  design  requirements 
to  provide  the  necessary  control  in  the  system  design  process.  Before 
establishing  R/M/T  design  specification  requirements  for  fault  tolerance, 
the  following  factors  must  be  considered: 
a  System  availability 
a  Functional  criticality 
a  Acceptable  degraded  modes  of  operation 

a  Inherent  reliability  of  lowest  level  of  functionally  redundant  ele¬ 
ments 

e  Diagnostic  capability  commensurate  with  reconfiguration  control 
a  Testability  of  the  function 
a  Maintenance  concept  employed 
a  System  level  quantitative  R/M/T  requirements. 


2-37 


Specific  examples  highlighting  the  interrelationship  of  these  R/M/T  fac¬ 
tors  with  fault  tolerance  ere  contained  in  Section  3.  ; 

2.3.1  System  Quantitative  R/M/T  Requirements 

There  ere two  approaches  to  establishing  qualitative  and  quant'ta- 
tive  fault  tolerance  requirements.  The  first  approach  (the  classical 
top-down)  involves  first  establishing  mission  requirements  and  than 
deriving  fault  tolerance  requirements  as  a  function  of  the  mission,  resto¬ 
ration,  and  testability  design  characteristics.  This  approach  !s  appro¬ 
priate  for  AF  program  managers  who  define  requirements.  The  second 
approach  defines  the  lowest  level  of  functional  element  (bottom-up  ap¬ 
proach)  and  then  establishes  fault  tolerance  requirementr  in  relationship 
to  the  criticality  of  each  system  function.  These  subsystem  and  lower 
level  requirements  must  satisfy  overall  allocations  of  system  level  fault 
tolerance  requirements.  This  latter  approach  is  employed  by  contractors 
when  the  selected  design  of  a  fault  tolerant  C9I  system  involves  exten¬ 
sive  use  of  off-the-shelf  equipment. 

A  major  concern  of  both  approaches  is  to  achieve  high  system  readi¬ 
ness  (i.e.,  availability).  Quantitative  top-level  fault  tolerance  require¬ 
ments  should  be  derived  from  parametric  sensitivity  analyses  and  trade¬ 
offs  to  optimize  system  readiness.  The  process  of  establishing  and  later 
refining  these  top-level  fault  tolerance  requirements  during  the  design 
process  is  outlined  in  para.  5.1. 

The  subsections  that  follow  contain  specific  recommendations  useful 
in  developing  R/M/T  specification  requirements  for  fault  tolerant 
systems.  In  addition,  AF  program  managers  should  consider  the 
following  general  guidelines  when  deriving  R/M/T  system  specification 
requirements: 

e  Is  the  requirement  overspecified?  (Leading  to  higher  develop¬ 
ment,  test  and  production  costs) 

a  Is  the  wording  of  the  requirements  subject  to  misinterpretation? 

e  Is  the  requirement  necessary,  or  is  it  included  merely  because  of 
previous  usage? 


2-33 


•  Can  compliance  with  the  requirement  be  verified? 

•  Have  adequate  design  margins  (tolerance)  been  allowed? 

•  Has  tailoring  been  considered  for  all  referenced  standards? 

The  exact  method  of  specifying  R/M/T  depends  on  the  equipment/ 
system  that  is  being  developed  and  its  ultimate  application.  The  custom¬ 
ary  language  used  in  system  specifications  must  be  supplemented  when 
specifying  the  R/M/T  of  fault  tolerant  C3I  systems.  Guidance  in  speci¬ 
fying  R/M/T  requirements  for  these  fault  tolerant  systems  is  provided  in 
the  following  sections. 

2. 3. 1.1  Reliability/Fault  Protection  Coverage  Requirements  -  Fault  toler¬ 
ant  systems  will  continue  to  function  in  the  presence  of  faults  or  errors 
within  the  system.  These  faults  and  errors  may  result  in  no  loss,  par¬ 
tial  loss,  or  complete  loss  of  system  functions.  Partial  loss  of  system 
functions  can  result  in  varying  levels  of  degraded  system  performance. 

During  the  Concept  Exploration  and  Demonstration/Validation  phases, 
program  managers  must  consider  the  permissible  level  of  system  perfor¬ 
mance  degradation  that  can  be  tolerated  without  compromising  mission 
success.  Based  upon  these  findings,  satisfactory  system  performance 
can  be  defined.  This  definition  of  satisfactory  system  performance  is 
then  included  in  and  keyed  to  the  Reliability  Requirements  Section  of  the 
C3I  System  Specification.  If  the  actual  C3I  system  operating  modes  are 
known  during  the  Concept  Exploration  and  Demonstration/Validation 
phases,  they  should  be  substituted,  as  applicable,  in  lieu  of  system  per¬ 
formance  levels  when  defining  satisfactory  system  operation. 

The  following  should  be  considered  by  AF  program  managers  in 
preparing  reliability  requirement  inputs  to  fault  tolerant  C3I  system 
specifications: 

a.  Quantitative  mission  reliability 

b.  Quantitative  maintenance  frequency  reliability 

c.  Description  of  storage,  transportation,  operation  and  maintenance 
environments 


2-39 


d.  Time  measure  or  mission  profile 

e.  Definition  of  satisfactory  and  acceptable  degraded  system  per¬ 
formance 

f.  Tolerable  failure  policy  (Single-point  failure,  fail-safe,  etc.) 

g.  Failure  independence 

h.  Critical  mission  definition. 

Items  a  thru  e  above  are  the  normally  specified  reliability  inputs  to 
system,  prime-item  development  and  lower-tier  development  specifications. 
Paragraphs  6.2  and  12.3.1  of  MIL-HDBK-338  provide  guidance  and 
examples  for  preparing  these  reliability  specification  inputs.  Items  f 
thru  h  are  additional  recommended  specification  inputs  for  fault  tolerant 
systems. 

Item  b,  the  quantitative  maintenance  frequency  reliability,  is  speci¬ 
fied  in  operational/field  terms  (e.g.,  Mean-Time-Between-Maintenance- 
Action  (MTBMA)  and  Mean-Time- Between-Maintenance- Inherent  (MTBMI) 
in  major  system  specifications  in  accordance  with  Department  of  Defense 
Directive  5000.40,  Reliability  and  Maintainability.  Maintenance  frequency 
reliability  may  be  specified  in  terms  of  operational/field  and/or  contrac¬ 
tual  terms  (e.g.,  Mean-Time-Between-Failure  (MTBF))  in  lower-tier  de¬ 
development  and  equipment  specifications.  Operational/field  requirements 
relate  to  maintenance  organization  needs  for  field  equipment  and  are 
based  on  the  performance  of  existing  systems  and  a  validated  degree  of 
design  and  technology  improvement  that  can  be  provided  at  a  reasonable 
cost.  Contractual  requirements  are  based  on  the  inherent  design  charac¬ 
teristics  and  are  related  to  the  mission  needs  of  the  operating  organiza¬ 
tion.  The  terms  MTBF  and  mean-time-to-repair  (MTTR)  are  exclusively 
contract  terms  and  are  often  verified  by  R&M  demonstration  tests. 

The  difference  between  operational/field  and  contractual  require¬ 
ments  is  described  in  DoD  Directive  5000.40,  Reliability  and  Maintainabil¬ 
ity,  and  in  Reference  2,  and  is  best  illustrated  by  comparing  the  param¬ 
eters  MTBMA  and  MTBF.  MTBF  is  calculated  using  equipment  operating 


min  n  ulii  I-  —  ■  mfca 


l 


time  and  chargeabie  failures  (which  exclude  e.g.,  induced  failures,  no¬ 
defect  actions,  and  minor  corrosion  maintenance  actions).  The  MTBMA 
of  the  same  equipment  might  be  calculated  using  a  different  time  base 
(e.g.,  flight  time)  and  would  include  maintenance  events  such  as  induced 
failures,  no  defect  actions  and  minor  corrective  maintenance  actions. 
The  specification  of  operational/field  R&M  requirements  as  being  separate 
and  distinct  from  contractural  requirements  is  not  unique  to  fault  toler¬ 
ant  systems.  However,  program  managers  of  fault  tolerant  systems  are 
advised  to  pay  particular  attention  to  this  distinction  in  view  of  the  em¬ 
phasis  placed  on  meeting  numerical  R&M  requirements.  In  general,  AF 
program  managers  should  insure  the  following  when  specifying  mainte¬ 
nance  frequency  reliability: 

e  Operational/field  terms  to  be  distinguished  from  contractual  terms 

e  Numerical  traceability  from  operational/field  terms  to  contractual 
terms 

e  Consistency  established  and  maintained  between  operational/field 
and  contractual  requirements. 

Item  e,  the  definition  of  satisfactory  and  acceptable  degraded  sys¬ 
tem  performance,  applies  to  quantitative  mission  reliability  which  may  be 
expressed  in  terms  of  mission-time-between-critical-failure  (MTBCF)  or 
probability  of  mission  success  (R^).  In  some  cases  the  definition  of 
system  failure  may  be  preferable  to  specifying  the  definition  of  satisfac¬ 
tory  performance.  Or,  depending  on  the  situation,  including  both  de¬ 
finitions  may  be  useful.  Program  managers  should  emphasize  two  objec¬ 
tives  in  developing  a  definition  of  satisfactory  and  acceptable  degraded 
system  performance.  The  first  objective  is  to  remove  any  ambiguity  from 
the  interpretation  of  quantitative  reliability  requirements  and  their  meth¬ 
od  of  verification.  Secondly,  by  properly  defining  an  acceptable  level  of 
degraded  performance,  a  design  containing  unnecessary  system  complex¬ 
ity  may  be  avoided. 


2-41 


A  clear,  unequivocal  definition  of  "failure"  must  be  established  for 
the  equipment  or  system  relative  to  its  important  performance  parame¬ 
ters.  Successful  system  (or  equipment)  performance  must  be  defined 

and  expressed  in  terms  which  will  be  measurable  during  the  demonstration 
test.  Parameter  measurements  during  the  demonstration  tests  usually  in¬ 
clude  both  go/no-go  performance  attributes  and  variable  performance 
characteristics.  Since  fault  tolerant  systems  are  often  designed  to  de¬ 
grade  gracefully  (see  para.  4.1.7),  the  limits  of  acceptable  performance, 
which  are  usually  set  at  levels  below  which  a  mission  may  be  degraded 
beyond  an  acceptable  level,  should  be  established  prior  to  testing. 

Failure  of  go/no-go  performance  attributes  such  as  channel  switching, 
target  acquisition,  target  classification,  etc.,  are  relatively  easy  to  de¬ 
fine  and  measure  to  provide  a  yes/no  decision  boundary.  Failure  of  a 
variable  performance  characteristic,  on  the  other  hand,  is  more  difficult 
to  define  in  relation  to  the  specific  limits  beyond  which  system  perfor¬ 
mance  is  considered  unsatisfactory. 

Figure  2-2  illustrates  the  two  types  of  performance  characteristics 
and  corresponding  success/failure  (yes/no)  decision  boundaries  that 
might  be  applied  to  a  track  radar  or  to  a  missile  active  seeker  (guid¬ 
ance)  system.  In  both  cases,  the  success/failure  boundary  must  be  de¬ 

termined  for  each  essential  system  performance  characteristic  measured  in 
the  demonstration  test.  They  must  be  defined  in  clear,  unequivocal 
terms.  This  will  minimize  the  chance  for  subjective  interpretation  of 
failure  definition,  and  post-test  rationalization  (other  than  legitimate  di¬ 
agnosis)  of  observed  failures. 

The  criticality  of  the  C3I  system  or  certain  of  its  functions  often 
dictates  that  design  requirements  be  set  forth  for  tolerable  failure  policy 
and  failure  independence.  Therefore,  consideration  should  be  given  to 
specifying  fail-safe/fail-operational  design  and  prohibiting  single-point 
failures. 


2-42 


UMENTATION 


Verification  of  Syrtem  Performance  Characteristic*. 


Failure  independence  requirements  may  stipulate  fault  containment 
or  fault  propagation  restrictions  to  limit  both  the  immediate  effects  of 
faults  and  possible  secondary  failure  effects.  When  specifying  a  toler¬ 
able  failure  policy  or  failure  independence  requirements,  be  sure  to  in¬ 
clude  the  equipment  level  to  which  the  requirement  applies.  For  exam¬ 
ple,  if  redur.dant  subsystems  were  used,  faults  would  be  tolerated  at  the 
subsystem  level.  The  system  level  is  above  that  at  which  the  faults 
would  be  tolerated  while  the  assembly  or  card  level  would  be  below  that 
at  which  faults  would  be  tolerated. 

C3I  systems  may  contain  many  operating  modes  and  functions  some 
of  which  are  used  in  peacetime  and  some  in  wartime.  In  such  cases,  it 
is  recommended  that  a  critical  mission  capability  (that  is  tied  to  an  es¬ 
sential  mission  performance  level)  be  defined.  This  definition  could  then 
be  related  to  quantitative  reliability  and  availability  requirements  and 
their  respective  demonstrations/verifications. 

Figure  2-3  provides  two  examples  of  the  reliability  specification  of 
fault  tolerant  C3I  systems.  The  first  example  is  of  a  C3I  data  fusion 
system  made  up  of  existing  off-the-shelf  computers  and  other  equipment. 
The  second  is  a  fault  tolerant  flight  control  computer  used  on  a  Csl 
platform. 


2. 3. 1.2  Fault  Protection  Coverage  -  Fault  protection  coverage  is  a  con¬ 
cept  that  can  be  stated  in  both  quantitative  and  qualitative  terms.  The 
quantitative  statement  is  used  most  often  in  reliability  modeling  of  re- 
configurable  or  redundant  systems.  The  output  of  these  reliability  mod¬ 
els,  the  probability  of  system  success,  has  been  found  to  be  quite  sensi¬ 
tive  to  the  fault  protection  coverage  parameter.  In  its  quantitative 
sense,  fault  protection  coverage  is  the  conditional  probability  that  the 
system  successfully  recovers  when  a  specific  type  of  failure  has  occur¬ 
red.  What  constitutes  proper  recovery  is  a  direct  function  of  the  in- 


2-44 


INDEPENDENCE  OF FAILURE  -  Th>  XY2  Mtn  Ml  bt  dwtanvd  wuh  th«t  «t  tty  ABC  In,  mtwyrtm, 
unit, Mwmoiy,  LRU, SRU) Evwno fMur* Ml  Induct any ether faUur*. 

f*VLT  TQLBWAMCg  MEOuiREMEWTa  -  The  XYZ  tyttam  Ml  PtwklM 

ability  In  oompllenct  with  the  contractual  i WCWmtnt  of  work,  and  In  tooordcnoc  w**h  tht  following  criteria. 

LEVEL  Of  FAULT  TOLERANCE  -  Thlt  owMblllty  shall  tpudfy  tha  tyttam  (or  tutoayrtam,  by  function)  n- 
iponaa  to  any  randomly  occurring  single  fault,  or  aaqucncc  of  unrelated  faults.  PaEura  Is  daflnad  as  any  situa¬ 
tion.  detected  by  any  moans.  In  which  tha  XYZ  tyttam  dost  not  moot  specification  requirement*  during  opera¬ 
tion.  The  appHaatolc  Iswdt  for  dingle  fault  are: 

a.  Paft-aato:  XY2  system  output  data  frosan  or  disabled  -  failure  anmindatad  -  before  any  variable  error 
Dt6ndi  2x  ymtfltrf  teouraey. 

b.  Fill- Operational :  XYZ  system  output  data  continues  uninterrupted  —  status  change  annunciated  —  trans- 
slant  disturbances  do  not  exceed  2x  specified  accuracies. 

Tha  response  of  tha  Fall-Operational  tyttam  to  each  subsequent  fault  In  a  sequence  shad  result  In  no  lass  than 
the  following  system  state  (provided  by  redundancy  or  a  degraded  operating  mode): 

FtfhCfiimHoMl/Ft/l‘S»M  -  (System  state  compiles  with  Fail-Safe  criteria) 

FAULT  FROTECTtON  COVERAGE  -  At  the  ‘FAIL-SAFE"  level,  tha  effectiveness  Of  the  XYZ  system  shall 
not  be  less  then  (VV)  percent  for  a  two-hour  sortie. 

EFFECTIVENESS  OF  FAULT  TOLERANCE  -  A  'allure  Mode  and  Effects  Analysis  (FMEA)  shall  be  per¬ 
formed  to  determine  the  effectiveness  of  the  fault  tolerant  design.  Component  end  functional  area  failure 
probabilities  shall  bo  calculated  for  the  XYZ  system's  MTBF  and  MTBCF. 

R#7-3537-029(2/Z)(T) 


Figure  2-3.  Examples  of  Reliability  Spoctflaatton  of  Fault  Tolerant  Systems.  (Shoot  2  of  2) 

tended  criticality  of  the  application.  It  may  mean  merely  establishing  a 
workable  hardware  system  configuration  (such  as  communications  switch¬ 
ing  processors),  it  may  require  that  data  flow  not  be  interrupted  (such 
as  a  satellite  attitude  control  system  computer),  or  it  may  mean  error  free 
processing  (no  erroneous  results  are  output  from  the  processing  element). 
The  formulation  of  the  probability  of  recovery,  i.e.,  establishing  a  work¬ 
able  system  configuration,  can  be  illustrated  if  one  considers  the  case  of 
a  communications  system  containing  a  number  of  active  and  standby  spare 
processors.  If  one  active  processor  fails,  the  probability  of  recovery 
would  be  equal  to  the  joint  probability  of  correct  fault  detection  and 
correct  fault  isolation,  'and  the  switching  over  to  a  backup  spare  proces¬ 
sor  and  that  the  backup  spare  successful!/  restores  operation  (e.g., 
boots,  loads  from  memory  and  resumes  process) . 


A  second,  more  limited  quantitative  definition  of  fault  protection 
coverage  relates  to  the  probability  of  detecting  any  fault.  The  value  of 
fault  protection  coverage  can  be  determined  by  using  the  average  of  the 
coverages  for  all  possible  classes  of  failures  weighted  by  the  probability 
of  occurrence  of  each  fault  class. 


2-46 


A  third,  more  limited,  quantitative  definition  of  fault  protection 
coverage  is  the  probability  that  a  particular  class  of  fault  is  successfully 
detected  before  a  complete  system  failure  occurs.  Fault  classes  include 
the  following:  latent,  permanent,  transient,  intermittent,  catastrophic, 
common  cause,  design,  and  single  point. 

The  qualitative  meaning  of  fault  protection  coverage  specifies  the 
types  of  errors  against  which  a  particular  redundancy  scheme  guards. 
For  example,  the  coverage  of  Hamming  single-error-correcting,  double¬ 
error-detecting  code  is  the  correction  of  all  single-bit  errors  in  a  code 
word,  and  the  detection  of  all  double  bit  errors  and  some  multiple  bit 
errors . 

The  specification  of  fault  protection  coverage  can  take  many  forms 
starting  with  the  top  level  system  specification  and  working  down  to  low¬ 
er  level  specifications.  The  top  level  system  specifications  usually  speci¬ 
fy  fault  protection  coverage  as  follows: 

"FAULT  PROTECTION  COVERAGE  -  All  fault  classes  for  the  XVI 
system  shall  be  covered  except  for  the  following  (e.g 

1 .  Generic  faults  which  affect  all  processor  channels  in  an 
identical  manner 

2.  Multiple  faults ,  i.e.,  faults  which  affect  multiple  proces¬ 
sor  channels  simultaneously 

3.  Faults  which  occur  during  reconfiguration 

In  addition  to,  or  in  lieu  of,  the  qualitative  form  of  specificing  fault 
protection  coverage,  lower  level  prime  item  development/equipment  speci¬ 
fications  may  include  a  quantitative  requirement  for  fault  protection  cov¬ 
erage  by  taking  the  form: 

"FAULT  PROTECTION  COVERAGE  -  The  fault  protection  coverage 
(FPC)  of  the  XYZ  subsystem  shall  not  be  less  than  xxx  percent. 
Fault  protection  coverage  is  the  combination  of  the  independent 
probabilities  of  Fault  Detection  ( FD ),  Fault  Isolation  (FI),  and  Fault 
Recovery  (FR)  for  all  possible  faults  of  the  system." 


2-47 


2. 3. 1.3  Malntainability/Testabiiity  Requirements  -  An  excellent  guide  to 
these  requirements  is  provided  in  Appendix  A,  pare.  40.1.1  of  MIL-STD- 
470 A,  Maintainability  Program  for  Systems  and  Equipment  particularly 
those  that  pertain  to  identifying  and  quantifying  maintainability  needs. 
This  data  is  a  recommended  reference  guide  before  undertaking  this 
task.  All  operational  and  deployment  constraints,  listed  in  para. 
40.1.1.2  of  MIL-STD-470A  as  fundamental  to  the  user's  needs,  are  of 
particular  importance  to  a  manager  wrestling  with  redundancy  versus 
corrective  maintenance  tradeoffs. 

Another  excellent  guide  is  provided  in  Appendix  A,  paras.  50.5.6 
and  50.5.7  of  MIL-STD-2165,  Testability  Program  for  Electronic  Systems 
and  Equipment.  These  guidelines  pertain  to  testability  requirements 
which  must  be  considered  for  inclusion  in  a  system  specification.  Figure 
5  of  that  Appendix  A  is  shown  here  as  Fig.  2-4  and  lists  13  model  require¬ 
ments  (a  thru  m)  for  system  testability.  Two  additional  recommended 
model  requirements  (n)  and  (o)  are  also  listed  in  Fig.  2-4.  In  certain 
C3I  system  applications  a  manual  error  recovery  requirement  (n)  may  be 
a  necessary  addition  to  automatic  error  recovery  (model  requirement  (I)). 
Manual  error  recovery  should  make  maximum  utilization  of  the  hardware 
and  software  implemented  for  requirement  (a),  status  monitoring,  to  alert 
an  operator  or  crew  member  to  execute  an  error  recovery  action.  Typical 
operator  actions  may  include  manually  switching  to  a  backup  operating 
mode,  correcting  the  error  by  replacing  an  easy  access,  plug-in  module, 
or  by  temporarily  continuing  system  operation  in  a  degraded  operating 
mode.  Air  Force  and  contractor  program  managers  of  fault  tolerant  system 
development  efforts  should  consider  the  following  guidance  applicable  to 
two  (I  and  m)  of  these  model  requirements. 

Automatic  error  recovery  methods  such  as  reconfiguration,  error 
correction  code,  checkpoint  rollback,  redundant  message  sending,  and/or 
retry  may  be  incorporated  in  fault  tolerant  designs.  It  is  important  that 
wherever  possible,  the  specified  requirement  for  automatic  error  recov¬ 
ery  (I)  be  coordinated  with  and  make  use  of  the  planned  hardware  and 


2-48 


3.X.X  Design  for  testability 

a.  Requirement  for  status  monitoring. 

b.  Definition  of  failure  modes,  including  interconnection 
failures,  specified  to  be  the  basis  for  test  design. 

e.  Requirement  for  failure  coverage  detection)  using 
full  test  resources. 

d.  Requirement  for  failure  cover ege  using  BIT. 

e.  Requirement  for  failure  coverage  using  only  the 
monitoring  of  operational  signals  by  BIT. 

f.  Requirement  for  maximum  failure  latency  for  BIT. 

g.  Requirement  for  maximum  acceptable  BIT  false  alarm 
rate;  definition  of  false  alarm. 

h.  Requirement  for  fault  isolation  to  a  replaceable  item 

using  BIT. 

i.  Requirement  for  fault  isolation  times. 

j.  Restrictions  on  BIT  resources  in  terms  of  hardware  size, 
weight  and  power,  memory  size  and  test  time. 

k.  Requirement  for  BIT  hardware  reliability. 

L  Requirement  for  automatic  error  recovery. 

m.  Requirement  for  fault  detection  consistency  between 
hardware  levels  and  maintenance  levels. 


*n.  Requirement  for  manual  error  recovery. 

*o.  Requirement  for  the  identification  of  the  level  for  which 
faults  can  and  cannot  be  t  tolerated. 

*  Additional  recommended  requirements  which  are  not  presently  included  in 
MIL-STD-2165. 

M7SM1-608 

R87-3637-008(T) _ 

Figure  2-4.  Mock!  Requirement!  for  Teetabllity  in  a  Syrttm  Specification. 


2-49 


software  intended  to  fulfill  requirements  (d),  (e)  end  (h).  The  speci¬ 
fication  of  automatic  fault  recovery  methods  should  include  the  following 
as  applicable: 

•  Identification  of  the  fault  classes  (see  para.  2.3.1. 2)  to  which 
the  particular  recovery  methods  apply 

•  Specific  maximum  allowable  recovery  time. 

Requirement  (m),  for  fault  detection  consistency  between  hardware  sizing 
and  partitioning  levels  vs.  maintenance  replacement  levels,  should  be 
specified  in  conjunction  with  requirement  (h),  the  requirement  for  fault 
isolation  to  a  replaceable  item  using  BIT.  This  recommended  practice 
will  aid  the  BIT  designer  in  a  clearer  understanding  of  the  replaceable 
unit  assembly  level  to  which  he  should  be  isolating  (e.g.,  subsystem, 
LRU,  SRU,  component,  etc.).  This  will  avoid  duplication  of  efforts  be¬ 
tween  the  BIT  and  ATE  programs. 

Table  2-6  presents  a  typical  format  covering  numerical  requirements 
(a),  (c),  (d),  (e),  (g),  (h)  and  (i)  of  Fig.  2-4  as  well  as  many  other 
testability  and  maintainability  parameters  of  interest.  This  Notational 
Diagnostic  Performance  Specification  is  recommended  in  Reference  3  to  be 
a  deliverable  item  after  both  the  Demonstration/Validation  phase  and  the 
Full-Scale  Development  phase.  By  accurately  quantifying  all  the  listed 
parameters  of  this  specification,  a  meaningful  assessment  can  be  made  of 
a  fault  tolerant  C3I  system's  testability  and  maintainability. 

2.3.2  Verification 

All  contractual  R/M/T  requirements  must  have  a  contractually  speci¬ 
fied  method  of  verifying  compliance.  There  are  several  measures  which 
quantify  the  numerical  R/M/T  requirements  in  both  contractual  and  oper¬ 
ational  terms.  These  must  be  distinguished  from  each  other  in  docu¬ 
menting  the  requirements  and  the  associated  verification  method. 

Contractual  specifications  must  delineate  the  analysis  methods  and 
demonstration  tests  that  must  be  performed  to  verify  that  the  specified 


TABLE  2-6.  WotHowl  Diagnostic  Performance  Specif  Icetton 


requirement  has  been  met.  For  demonstration  tests,  the  specification 
should  define  the  following: 

a.  How  will  the  equipment/system  be  tested? 

Test  conditions,  environmental  conditions,  test  measures, 
length  of  test,  equipment  operating  conditions,  accept/ re¬ 
ject  criteria,  test  reporting  requirements,  etc. 

b.  Who  will  actually  perform  the  tests? 

Contractor,  Government,  or  independent  organization 

c.  When  will  the  tests  be  performed? 

Development,  production,  or  field  operation  phases 

d.  Where  will  the  tests  be  performed? 

Contractor's  plant.  Government  organization,  or  field. 


2-51 


Tha  planned  R&M  growth  of  tho  system,  if  any,  mutt  alto  be  con- 
tidarad  and  related  to  the  tchadula  for  tha  damonatrations.  If  analysis 
ahowa  meaningful  R&M  growth  batwaan  tha  achadulad  damonatration  perl- 
oda  and  system  maturity,  conaidaration  ahould  ba  givan  to  apacifying 
thaaa  initial  quantitative  R&M  requirements  in  tha  Syatam  Specification. 
It  may  ba  necettary  to  conduct  aavaral  time-phaaed  ayatam  level  R&M 
demonstrations  at  major  program  miiestonea  of  a  C3 1  syatam  development 
effort.  If  meaningful  R&M  growth  is  expected  during  this  period,  tha 
AF  program  manager  should  consider  specifying  numerical  R&M  require¬ 
ments  as  part  of  R&M  growth  curves.  These  R&M  growth  curves  should 
be  incorporated  in  the  requirements  section  of  the  System  Specification. 

Fault  detection/isolation,  reconfigurability  and  self-healing  perfor¬ 
mance  as  well  as  the  maintainability/repair  philosophy  should  be  validated 
as  early  in  the  development  as  possible  in  order  to  demonstrate  fault 
protection  coverage.  Traditionally,  this  has  been  accomplished  at  the 
end  of  the  Full-Scale  Development  phase,  which  promotes  a  reluctance 
to  rectify  problems  because  both  the  contractor  and  the  procuring 
agency  are  anxious  to  begin  production  and  get  the  product  into  ser¬ 
vice.  Thus,  problems  such  as  excessive  false  alarms,  too  many  "cannot 
be  duplicated"  and  "retest  ok’s",  etc.  are  not  properly  resolved.  Pro¬ 
gram  managers  should  attempt  to  avoid  such  problems  by  validating  high 
risk  areas  early  in  the  development  phase  where  corrective  actions  have 
minor  impact  on  cost  or  schedule. 

Exhaustive  simulation  and  testing  should  be  accomplished  on  rep¬ 
resentative  high  risk  hardware  elements  as  early  as  possible  in  the  de¬ 
velopment  cycle.  It  is  important  to  cull  out  design  deficiencies  in  a 
planned  approach  so  that  modifications  and  changes  in  test  strategies  can 
be  implemented  while  the  design  is  still  in  its  infancy.  To  this  end,  AF 
program  managers  should  require  the  contractor  to  document  the  planned 
approach  for  evaluating  and  demonstrating  how  well  a  fault  tolerant  de¬ 
sign  meets  its  specified  fault  tolerance  goals  and  requirements.  This  is 
accomplished  by  including  a  requirement  in  the  C*l  system  SOW  for  such 
tests  to  be  identified  and  described  in  the  System  Test  Plan,  Qualifica- 


2-52 


tion  Test  Plans,  Engineering  Development  Test  Plans,  Testability  Demon¬ 
stration  Plan,  and  Reliability  Development/Growth  Test  Plans,  as  applica¬ 
ble. 


Provisions  should  be  made  for  a  maturation  plan  for  activities  as 
each  program  phase  progresses.  The  plan  should  provide  for: 

e  Comparative  analysis  between  test  methodologies  and  Maintainabil¬ 
ity/Diagnostic  philosophies  of  the  proposed  system  and  similar 
systems  already  fielded 

e  A  means  to  improve  the  proposed  system  by  utilizing  lessons 
learned  and  deficiencies  of  prior  generation  systems 

•  A  schedule  of  the  demonstration/validation  milestones  and  re¬ 
sources  required  to  perform  these  maturation  activities  (e.g., 
prime  hardware,  laboratory  facilities,  etc.) 

•  Establishment  of  a  testability  and  maintainability  performance  da¬ 
ta  collection  system 

•  Evaluation  of  false  alarms  and  false  removals  in  the  system's  ac¬ 
tual  or  simulated  environmental  profile  conditions 

•  Evaluation  of  diagnostic  support  equipment 

•  Testability  and  maintainability  maturation  profiles  should  include 
periodic  summaries  of  performance  throughout  the  development 
cycle  as  well  as  the  results  of  the  verifications. 

2.3.3  Warranties 

The  inclusion  of  reliability  improvement  warranties  (RIW)  in  re¬ 
quests  for  proposals  and  production  procurement  contracts  will  be  a  ma¬ 
jor  contributor  to  the  success  of  complex  fault  tolerant  military  hardware 
programs.  Initially,  these  warra|lffes  provide,  prior  to  contract  award, 
a  realistic  basis  for  evaluatiD^rthe  reliability  of  the  equipment  proposed 
by  the  seller.  The  pro<jfoures  for  implementing  RIWs  on  fault  tolerant 
designs  are  similar  toJfnose  used  for  non-fault  tolerant  system  procure¬ 
ments.  However,  tjie  seller's  response  to,  and  especially  the  pricing  of 
the  warranty  foaft  fault  tolerant  system,  will  be  a  direct  measure  of  the 
seller's  assess/ient  of,  and  confidence  in,  the  ability  of  the  equipment  to 


2-53 


meet  the  stringent  R&M  requirements  imposed  on  fault  tolerant  systems. 
Later,  the  RfW  will  provide  the  procuring  activity  with  no-cost  engineer¬ 
ing  change  proposals  (ECPs)  which  will  improve  reliability  and  provide 
higher  system  availability  and  operational  readiness. 

It  may  seem  incongruous  to  include  RIW  requirements  in  the  pro¬ 
curement  contracts  for  fault  tolerant  systems  considering  the  total  high 
reliability  of  such  systems.  But  it  must  be  realized  that  the  components 
of  the  system  have  finite  reliabilities  and  will  at  times  fail.  Therefore, 
it  is  mandatory  in  the  deployment  of  fault  tolerant  systems  that  those 
components  be  repaired  or  replaced  very  rapidly. 

To  accomplish  this,  a  fundamental  design  feature  of  fault  tolerant 
systems  should  include  extensive  internal  monitoring  and  self-testing  of 
each  major  component  of  the  system  during  operation.  In  systems  that 
perform  a  critical  function,  these  self-tests  are  performed  prior  to  ini¬ 
tiating  that  function.  If  a  failure  is  detected,  the  function  is  not  initi¬ 
ated  and  possibly  the  mission  aborted  or  vital  data  lost.  In  either  case 
the  detected  faults  are  recorded  for  display  to  the  line  maintenance 
crew.  Because  of  the  national  security  considerations  attendant  to  C3I 
systems,  every  effort  must  be  made  to  eliminate  unreliable  systems  com¬ 
ponents  which  could  reduce  operational  readiness  and  cause  excessive 
system  downtime.  For  these  reasons,  AF  program  managers  should  con¬ 
sider  including  RIW  requirements  in  requests  for  proposals  and  produc¬ 
tion  contracts  for  fault  tolerant  systems.  Typical  parameters  warrantied 
include  MTBF  and  BIT  false  alarm  rates. 

Competitive  bid  reliability  incentive  and  warranty  programs  motivate 
contractors  to  provide  equipments  with  the  highest  practical  reliability 
and  operational  readiness.  These  incentive  and  warranty  programs  focus 
on  the  contractor's  essential  tasks  and  responsibilities  and  the  Govern¬ 
ments  major  concerns  viz.,  equipment  reliability  and  operational  readi¬ 
ness.  Further  background  and  details  of  reliability  warranties  are 
provided  in  the  Fault  Tolerant  Design  Implementation  Guide. 


2-54 


2.4  SPECIFICATION  CHECKLIST  QUESTIONS 

The  following  questions  art  intended  to  ensure  that  program  man¬ 
agers  incorporate  appropriate  fault  tolerance  requirements  into  system 
specification  documentation . 

a.  What  are  the  overall  contractual  reliability,  maintainability  and 
availability  requirements?  How  do  fault  tolerance  requirements 
impact  the  overall  reliability,  maintainability,  and  availability  re¬ 
quirements? 

b.  Has  the  definition  of  satisfactory  system  performance  or  system 
failure  been  specified?  (PA) 

c.  Have  the  maximum  off-line  or  reconfiguration  time(s)  been  speci¬ 
fied  or  included  in  the  definition  of  satisfactory  performance? 
(PA) 

d.  Has  the  maximum  allowable  missing  data  been  specified? 

e.  Has  the  maximum  allowable  contamination  or  corruption  of  exist¬ 
ing  data  been  specified? 

f .  Have  the  allowable  fault  propagation  requirements  been  specified? 

g.  What  is  the  tolerable  failure  policy?  (single-point,  fail-safe,  etc.) 
(PA) 

h.  Have  the  fault  classes  to  be  tolerated  been  specified?  (C) 

i.  What  is  the  level  of  fault  protection  coverage  required  for  the 
system?  (C) 

j.  Have  the  false  alarm  constraints  been  specified? 

k.  Will  the  fault  tolerance  policies  and  methodologies  be  among  the 
vital  functions  of  the  program  to  be  evaluated  and  verified? 

l.  How  will  the  fault  protection  mechanisms  be  demonstrated  or  vali¬ 
dated? 

m.  Under  what  environmental  conditions  must  the  system  be  oper¬ 
ated  and  maintained?  The  more  difficult  the  environment  for 
both  operating  and  replacing  of  an  item,  the  more  cost-effective 
redundancy  becomes. 

n.  How  critical  is  it  that  the  proposed  system  survive  the  effects  of 
natural  and  weapons  enhanced  radiation  environments?  (PA) 


2-55 


o.  What  is  the  maximum  allowable  Mean  Time  to  Restore  the  system? 

As  this  time  becomes  shorter,  the  greater  the  need  to  require 

redundancy,  particularly  on  those  items  within  the  system  which 
have  larger  Mean  Time  to  Repair  figures. 

p.  What  similar  systems  already  developed  can  be  studied  to  extract 
some  of  the  specifications  required  and  cite  areas  for  improve¬ 
ment?  (PA) 

q.  What  functions  in  the  system  involve  the  most  risk  to  mission 

success  if  they  were  to  fail?  The  greater  the  risk,  the  greater 

the  demand  for  redundancy.  (C) 

r.  Has  a  requirement  for  manual  error  recovery  been  properly 
specified  if  this  technique  is  to  be  used? 

s.  Has  the  level  at  which  faults  can  and  cannot  be  tolerated  been 
specified?  (PA) 

t.  Has  consideration  been  given  to  including  an  RIW  requirement  in 
the  RFP  for  the  production  phase  contract?  (PA) 


. .  .ns 


I 

i 


3  -  RELATIONSHIP  OF  C3I  FAULT  TOLERANCE  TO  MISSION  AND  SAFETY 

CRITICALITY 

Fault  tolerance  requirements  for  C3I  systems  are  established  to  as¬ 
sure  the  availability  of  critical  mission  functions  and  to  avoid  potential 
safety  hazards.  This  section  describes  the  methodology  used  to  identify 
mission  and  safety  critical  functions  of  complex  systems  and  establish 
their  fault  tolerance  requirements.  Presented  herein  are  several  exam¬ 
ples  of  fault  tolerant  design  approaches  used  in  C3I  systems.  These 
examples  illustrate  areas  where  fault  tolerant  designs  may  be  used  and 
where  the  mission  operational  benefits  can  be  derived. 

3.1  FORMULATION  OF  C3I  FAULT  TOLERANCE  REQUIREMENTS 

The  deterrence  of  nuclear  conflict,  control  of  forces  and  employment 
of  weapons  all  strongly  depend  on  C3I.  Because  of  this  dependence  and 
its  importance,  C3!  systems  must  be  designed  to  be  fault  tolerant  in  or¬ 
der  to  become  survivable  and  available.  The  level  of  fault  tolerance  de¬ 
pends  upon  the  operational  mission,  its  relationship  to  national  security 
and  the  system  availability  and  safety  requirements.  Fault  tolerance 
must  be  judiciously  implemented  to  avoid  unnecessary  program  costs  and 
logistic  support  requirements  for  spares  and  maintenance  personnel. 

Fault  tolerance  requirements  are  normally  established  by  the  con¬ 
tractor  in  compliance  with  the  system  specification  and  are  used  by  de¬ 
signers  to  develop  subsystem  configurations.  Ultimate  AF  design  control 
of  this  process  is  exercised  by  approval  of  the  design  concept  at  PDR 
and  the  design  details  at  CDR. 


' U£i  ’-at 


3-1 


Figure  3-1  illustrates  how  the  mission  and  safety-critical  fault  toler¬ 
ance  requirements  are  established.  For  the  mission-related  require¬ 
ments,  the  various  functions  of  the  C3I  system  under  consideration  are 
identified  and  the  consequences  of  the  loss  or  degradation  of  each  func¬ 
tion  assessed.  This  evaluation  considers  the  effect  on  the  Cal  systems 
capability  and  on  the  overall  Cal  community,  i.e.,  its  impact  on  National 
security,  thereby  permitting  the  establishment  of  functional  criticality 
prioritization  and  the  cost  effective  application  of  fault  tolerance  require¬ 
ments  . 

It  is  essential  that  AF  program  managers  assure  themselves  that  the 
contractor's  methodology  and  criticality  assessment  of  mission  functions 
are  correct,  since  this  assessment  forms  the  basis  for  major  program  ex¬ 
penditures  in  manpower,  equipment,  testing,  and  future  logistic  re¬ 
sources. 

The  criticality  of  a  Cal  function  is  driven  by  its  application.  For 
example,  the  ability  to  guide  weapons  has  the  highest  functional  critical¬ 
ity  of  an  airborne  surveillance  radar  system.  However,  it  is  far  less 
functionally  critical  to  the  national  security  when  this  functional  capa¬ 
bility  is  compared  to  the  strategic  missile  detection  capability  of  an 
Infrared  (IR)  sensor  system  aboard  a  space  surveillance  system  satellite. 
By  establishing  a  hierarchy  of  criticality  among  Cal  functions,  each  sys¬ 
tem  function  can  be  ranked  in  terms  of  its  overall  Cal  military  impor¬ 
tance. 

Applying  this  rationale,  an  IR  sensor  satellite  designed  to  provide 
early  warning  detection  of  hostile  strategic  missile  launches  requires 
higher  levels  of  fault  tolerance  than  a  satellite  designed  to  provide 
meteorological  information  for  use  in  guiding  troop  movements.  The  cost 
of  restoration  or  repair  can  influence  functional  criticality,  assuming  that 
the  function  loss  or  system  downtime  can  be  tolerated.  It  may  be  more 


3-2 


PROGRAM 

REQUIREMENTS 


I  SAFETY 

|  WEK3HTMXUME 
|  ILS 


|  R/M/T 


MISSION 

SCENARIO 


R«7-3537-010(T) 


IDENTIFY  SYSTEM 
MISSION  FUNCTIONS 


CONDUCT  CRITICALITY 
ASSESSMENT  OF 
MISSION  FUNCTIONS 


DEVELOP  BASELINE  C3I 
FUNCTIONAL 
DESIGN  CONFIGURATION 


DEVELOP  ALTERNATE 

FAULT  TOLERANT  DESIGNS 
•REDUNDANCY 

•  GRACEFUL  DEGRADATION 

•  RECONFIG  STRATEGIES 

•  ETC 

CONDUCT  ASSESSMENT 
TO  IDENTIFY  SAFETY 
FAULT  TOLERANCE  REQMTS 


CRITICALITY  ASSESSMENT  OP  MISSION  FUNCTIONS 


C3!  MISSION 
FUNCTIONS 


DETECTION  OF 
ENEMY  MISSILE 
LAUNCH  PLUMES 


EFFECT  OF  LOSS 
OF  FUNCTION 


LOSS  OF  EARLY 
WARNING  CAPABILITY 
FOR  MISSILE  ATTACK 


IMPACT  ON  FUNCTIONAL 

NATIONAL  SECURITY  CRITICALITY 


ENEMY  FIRST 
STRIKE  POSSIBLE 
WITH  MAJOR  LOSSES 


NATIONAL 

SECURITY 


SAFETY  ASSESSMENT 


C3! 

EQUIPMENT 

HAZARDOUS  MATERIAL/ 
OPERATION/ENVIRONMENT 

WORST  POSSIBLE 
CONSEQUENCE 

HAZARD 

CRITICALITY 

SAFETY  DESIGN 
REQUIREMENTS 

RADAR 

RF  RADIATION  EXPOSURE 

TO  GROUND  PERSONNEL 

DUE  TO  INADVERTANT 

TURN-ON  OF  TRANSMITTER 

FATAL 

CATAST¬ 

ROPHIC 

•  DUAL  REDUND¬ 
ANT  WEIGHT 
ON-WHEELS 
INTERLOCK 

•  THREE  POWER 
CABLE  SWITCHES 
FOR  TURNON 

Figure  3-1.  Identification  of  Minion  &  Safety  Critical 
Fault  Tolerance  Requirements. 


3-3/4 


cost-effective  to  add  an  additional  layer  of  redundancy  to  a  potentially 
weak  link  in  a  satellite  (e.g.,  battery,  sensor,  etc.)  than  to  run  the 
risk  of  the  satellite's  premature  failure. 

The  safety  related  fault  tolerance  requirements  are  established 
based  upon  analysis  of  the  system's  potential  hazards.  These  conditions 
can  be  determined  by  identifying  all  hazardous  materials,  the  systems 
anticipated  operational  use  and  the  natural  and  induced  environmental 
exposure.  A  safety  assessment  can  be  conducted,  as  illustrated  in  Fig. 
3-1,  to  establish  the  safety  design  requirements.  This  evaluation  method 
is  an  extension  of  the  efforts  described  in  the  preliminary  hazard  list 
(Task  201  of  MIL-STD-882)  and  is  performed  by  system  safety  engineers. 
The  safety  assessment  is  conducted  very  early  in  the  system  acquisition 
life  cycle  with  emphasis  on  identification  of  fault  tolerance  provisions  for 
hazardous  areas.  The  analyst  reviews  each  C3I  subsystem  or  equipment 
to  determine  if  potential  safety  hazards  can  occur  as  a  result  of  hazar¬ 
dous  material,  operational  use,  environment,  or  other  conditions.  A 
hazard  criticality  is  established  based  on  worst-case  conditions  and  the 
potential  for  personnel  injury  or  damage  to  the  system  using  the  follow¬ 
ing  definitions  from  MIL-STD-882: 


DESCRIPTION 

CATEGORY 

MISHAP  DEFINITION 

CATASTROPHIC 

1 

DEATH  OR  SYSTEM  LOSS 

CRITICAL 

II 

SEVERE  INJURY,  SEVERE  OCCUPATIONAL 

ILLNESS,  OR  MAJOR  SYSTEM  DAMAGE 

MARGINAL 

III 

MINOR  INJURY,  MINOR  OCCUPATIONAL  ILLNESS, 
OR  MINOR  SYSTEM  DAMAGE 

NEGLIGIBLE 

R8  7-3537-0 11(T) 

IV 

LESS  THAN  MINOR  INJURY,  OCCUPATIONAL 
ILLNESS,  OR  SYSTEM  DAMAGE 

The  safety  engineer  will  then  establish  safety  design  criteria,  including 
fault  tolerance  provisions  that  are  based  on  the  hazard  severity,  a  quali¬ 
tative  assessment  of  the  hazard  probability  and  the  C3I  program  system 
safety  requirements. 

Air  Force  and  contractor  program  managers  should  carefully  assess 
the  contractor's  rationale  for  establishing  safety  related  fault  tolerance 


TABLE  3*1.  Typical  Functional  Crltlaallty  Prioritization. 


IV9TM  FUNCTION 

FUNCTIONAL  CRITICALITY 

WEAPON  GUIDANCE 

1  (HIGHEST) 

ATTACK  CONTROL 

2 

SYNTHETIC  APERTURE  RAOAR  IMAGERY 

3 

FIXED  TARGET  IDENTIFICATION 

3 

CLUTTER  MAP 

3 

SMALL  AREA  -  TARGET  CLASSIFICATION 

3 

ATTACK  PLANNING 

4 

SECTOR  SEARCH 

8 

WIDE  AREA  SURVEILLANCE 

6  (LOWEST) 

nsr-soii-oos 

R»7-3SJ7-0Z4(T) 

requirements.  It  may  be  advisable  to  re-evaluate  the  C3I  program  sys¬ 
tem  safety  requirements  in  light  of  the  evaluation  results  so  that  the 
program  objectives  can  be  achieved  without  compromising  system  safety. 

Section  4  describes  the  fault  tolerance  design  options  that  can  be 
implemented  to  satisfy  the  established  fault  tolerance  requirements,  and 
summarizes  their  inherent  advantages  and  disadvantages.  Tradeoffs  of 
design  alternatives  are  contingent  on  optimizing  the  LCC  and  system  ef¬ 
fectiveness  as  described  in  Section  5. 

3.2  EXAMPLES  OF  TYPICAL  C’l  FAULT  TOLERANCE  APPLICATIONS 

In  this  subsection,  two  types  of  fault  tolerant  systems  are  discussed;  a 
space  surveillance  system  and  an  airborne  radar  system.  These  are 
used  to  illustrate  how  various  fault  tolerance  approaches  can  be  applied 
to  effectively  enhance  C9I  mission  capabilities. 

3.2.1  Space  Surveillance  System 

The  space  surveillance  system  is  responsible  for  the  early  detection 
and  tracking  of  strategic  missile  launches.  This  system  consists  of  a 
constellation  of  orbiting  satellites  with  IR  sensors  that  detect  and  track 


3-6 


missile  plumes,  and  a  ground  segment  to  process  and  disseminate  the  data. 
As  illustrated  in  Fig.  3-2,  fault  tolerance  is  implemented  at  all  levels  of 
the  system  design. 

The  fault  tolerance  approach  for  the  space  surveillance  system  is 
based  on  two  major  considerations:  first,  the  safety  concerns  associated 
with  the  satellite  while  it  is  in  close  proximity  to  the  Shuttle  Orbiter; 
and  second,  the  mission  success  for  the  specified  life  of  the  satellite. 
MIL-STD-1574  and  NASA  publication  NHB  1700. 7A  define  the  fault  toler¬ 
ance  requirements  that  assure  the  payload  will  operate  safely  during  pre¬ 
launch,  launch  and  separation  from  the  Shuttle  Orbiter.  Single  fault 
tolerance  is  required  for  critical  hazards,  while  double  and  triple  fault 
tolerance  or  inhibits  are  required  for  catastrophic  hazards.  The  re¬ 
quirements  for  mission  related  fault  tolerance  are  derived  from  the  re¬ 
liability,  global  coverage,  survivability  and  availability  requirements  con¬ 
tained  in  the  system  specification.  Therefore,  the  system  is  designed  to 
tolerate  equipment  failures  during  long  periods  of  on-orbit  operation  and 
employs  a  variety  of  fault  tolerance  techniques,  from  error-correcting 
codes  to  redundancy  of  the  satellites  themselves. 

The  space  surveillance  system  must  provide  continuous  global  cover¬ 
age  even  if  a  satellite  fails  or  is  disabled  due  to  an  enemy  attack.  The 
space  segment  of  the  system  consists  of  a  constellation  containing  redun¬ 
dant  operating  satellites.  This  extensive  fault  tolerance  approach  is  ap¬ 
propriate  because  of  the  system's  high  mission  criticality  and  the  time 
delay  that  would  be  incurred  to  launch  replacement  satellites  or  to  per¬ 
form  on-orbit  maintenance.  In  addition  to  satellite  redundancy,  indi¬ 
vidual  satellites  have  a  stringent  mission  success  probability  requirement 
which  necessitates  the  use  of  extensive  fault  tolerance.  Stringent  reli¬ 
ability  and  fault  tolerance  requirements  are  generally  considered  cost  ef¬ 
fective  for  space  vehicles  because  of  the  high  launch  and  on-orbit  repair 
costs.  The  program's  design  goal  is  that  all  faults  result  in  either  no 
system  degradation  or,  at  worst,  degraded  performance  that  would  per¬ 
mit  ground  intervention  to  restore  the  system  to  full  performance  capa¬ 
bility. 


3-7/8 


SYSTEM  LEVEL 


GROUND 

STATIONS 


CONSTELLATION 


TYPICAL  SCENAF 


:<r& 


AUTONOMOUS  SATELLITE  OPERATION  WITH  LOSS  OF 
GROUND  SEGMENT 

CONSTELLATION  RECONFIGURABLE  TO  OPTIMIZE 
COVERAGE  WHEN  FAILURES  OCCUR’ 

MULTIPLE  GROUND  STATIONS 


•  EACH  GROUND  PIXEL  COVERED  I 

•  STEREO  CAPABILITY  PROVIDED  W 

•  RESOURCE  SHARING  AMONG  SA1 


R87-3337-02*(T) 


Pigura  3-2.  Spaca  Survailanca  Syitam  Fault  Tolar  a  nc»i. 


The  satellite's  mottle  foe* I  piano  IR  tontor  it  a  highly  fault  tolorant 
static  sensor  containing  thousands  of  mosaic  IR  detectors.  Failures  of 
individual  detectors  are  tolerated  tinea  they  ara  masked  by  the  large 
number  of  operating  detectors  and  by  the  data  aupplied  by  adjacent  sat¬ 
ellites  viewing  the  tame  target  area.  Although  the  lost  of  individual  de¬ 
tectors  does  not  compromise  system  performance,  the  lost  of  blocks  of 
detectors  would  significantly  impact  the  system's  detection  and  tracking 
capability.  Therefore,  fault  tolerance  design  guidelines  are  established 
to  permit  only  random  detector  losses.  The  satellite's  IR  tensor  config¬ 
uration  is  similar  to  the  phased  array  radars  which  contain  numerous 
transmit/ receive  modules.  Typically  between  5  to  10%  of  these  modules 
can  fail  randomly  before  the  radar  performance  degrades  beyond  its  ef¬ 
fective  use. 

The  data  management  subsystem  contains  the  application  code  that 
controls  the  spacecraft  subsystems,  including  the  redundancy  manage¬ 
ment  functions.  Two  design  goals  are  established:  first,  complete  fault 
tolerance  for  single  faults;  and  second,  provide  a  subsystem  similar  to  a 
single-string  computer,  so  as  to  simplify  the  application  code  and  mini¬ 
mize  development  cost.  The  configuration  selected  consists  of  a  pool  of 
processors  from  which  six  are  used  to  form  two  voting  triads.  Process¬ 
ing  channels  in  each  of  the  triads  communicate  with  the  other  elements 
over  interchannel  buses.  In  this  manner,  data  are  exchanged  for  dis¬ 
tribution  or  voting  purposes.  The  hardware  automatically  handles  the 
protocols  required  for  these  data  transfers. 

If  a  processor  channel  fails,  that  failed  channel  is  removed  from  op¬ 
eration  by  the  two  remaining  channels.  A  new  channel  is  activated,  run 
through  self-test,  its  application  code  downloaded  from  mass  memory, 
and  synchronization  is  initiated  with  the  other  two  operational  channels. 

The  triple  modular  redundancy  (TMR)  concept  was  chosen  to  meet  a 
stringent  time  requirement  for  fault  recovery.  TMR  offered  the  advan- 


3-11 


tages  of  a  simpler  oparating  system,  reduced  power  consumption,  and 
assurance  of  single  fault  tolerance,  although  the  required  recovery  time 
could  also  have  been  achieved  with  dual  processor  pairs  having  hot 
backups.  Other  spacecraft  subsystems  incorporate  similar  levels  of  re¬ 
dundancy  and  graceful  degradation  designs  to  ensure  meeting  high 
mission  reliability  and  long  mission  life  requirements. 

Fault  tolerance  requirements  for  the  ground  segment  are  much  less 
stringent  than  those  of  the  space  segment.  In  general,  when  a  failure 
is  detected,  the  maintenance  personnel  can  isolate  the  failure  to  a  line- 
replaceable  unit  (LRU),  replace  the  unit  with  a  spare  LRU  so  that  the 
system  car  resume  operations,  and  repair  th®  faulty  unit  at  one  of  the 
ground  depot  facilities. 

For  critical  command  and  control  functions,  a  fault  tolerant  redun¬ 
dant  equipment  approach  is  utilized.  As  an  example,  if  the  mission  mes¬ 
sage  processor  fails  in  the  fixed  ground  station,  the  backup  support 
processor  will  detect  the  critical  condition  using  a  timeout  mechanism  and 
then  assume  the  role  of  the  mission  message  processor.  The  watchdog 
timer,  shared  mass  storage,  and  all  mission  messages  received  by  both 
processors,  assure  a  minimum  loss  of  messages  to  users. 

The  ground  segment  operation  utilizes  fixed  and  mobile  ground 
stations.  Since  both  stations  continuously  transmit  mission  messages  to 
all  users,  the  failure  of  either  station  does  not  result  in  the  loss  of 
transmission  capability.  Upon  failure,  a  second  (backup)  mobile  station 
is  immediately  activated  and  commences  message  distribution  to  restore 
the  multiple  source  of  mission  messages.  Because  of  the  ready  availabil¬ 
ity  of  these  backup  mobile  stations,  widespread  implementation  of  redun¬ 
dant  processors  within  each  station  is  not  cost-effective. 

3.2.?  Airborne  Surveillance  Radar  System 

In  this  example,  an  airborne  surveillance  radar  system  will  be  util¬ 
ized  to  illustrate  how  fault  tolerance  and  in-flight  maintenance  can  be 


3-12 


used  to  achieve  high  system  availability  for  long  duration  missions,  and 
thereby  minimize  the  impact  on  life  cycle  cost.  The  system's  operational 
concept  and  interfaces  are  shown  in  Fig.  3-3.  Its  primary  mission  is  to 
locate  fixed  and  moving  enemy  ground  targets  and  provide  near- real-time 
weapon  guidance  information  to  aircraft  and  missiles.  In  conjunction 
with  other  C3I  assets,  this  system  is  also  used  to  neutralize  enemy 
forces  considered  to  be  an  immediate  threat.  Table  3-1  shows  the  rela¬ 
tive  criticality  of  the  various  system  functions  as  defined  in  the  system 
specification.  These  functional  criticalities  provided  guidance  to  the 
contractor  in  establishing  the  mission  fault  tolerance  design  priori¬ 
tization.  The  system  specification  also  listed  the  acceptable  degraded 
levels  of  system  performance.  When  coupled  with  system  reliability  mod¬ 
els  (see  para.  2.1.1),  the  functional  criticality  prioritization  was  useful 
in  determining  whether  candidate  designs  met  system  reliability  require¬ 
ments.  The  reliability  analysis  also  considered  the  effect  of  added  hard- 


ware  redundancy  on  overall  system  reliability  and  the  probability  of  suc¬ 
cess  for  the  various  functions. 

Early  in  the  system's  design,  RF  radiation  exposure  to  personnel, 
aircraft  emergency  egress  capability  and  the  common  safety  hazards  as¬ 
sociated  with  the  operation  and  maintenance  of  electronic  equipment  are 
identified  as  the  major  system  safety  concerns.  These  concerns  are  con¬ 
sidered  less  important  in  establishing  the  system  fault  tolerance  require¬ 
ments  when  compared  to  the  mission  criticality  impact.  Safety  inter¬ 
locks,  overrides  and  egress  features  incorporated  to  assure  system  safe¬ 
ty  have  a  minimal  impact  on  the  design  configuration  and  a  negligible  ef¬ 
fect  on  the  acquisition  and  logistic  costs. 

In  this  example,  the  airborne  surveillance  radar  system  is  required 
to  operate  up  to  20  hours  in-flight.  To  achieve  high  system  availability, 
various  forms  of  fault  tolerance  are  incorporated  along  with  an  in-flight 
maintenance  repair  capability.  In-flight  maintenance  is  accomplished  at 
both  the  shop  replaceable  unit  (SRU)  and  LRU  levels.  The  level  of 
in-flight  maintenance  chosen  for  each  equipment  is  based  on  an  opti¬ 
mization  of  on-board  spare  requirements,  diagnostic  capability,  mainte¬ 
nance  personnel  workload,  and  the  overall  system  availability  require¬ 
ment.  A  one-hour  mean  repair  time  is  specified  for  the  SRU  level  and 
30  minutes  for  the  LRU  level. 

The  radar  antenna  of  the  airborne  surveillance  radar  system  is  not 
considered  a  candidate  for  in-flight  maintenance  since  it  is  located  ex¬ 
ternally  and  is  inaccessible  during  flight.  High  antenna  availability  is 
achieved  with  the  use  of  fault  tolerant  design  features  and  with  hard¬ 
ware  that  has  proven  reliability  (where  redundancy  applications  are  im¬ 
practical).  The  radar  antenna  aperture  contains  hundreds  of  array  ele¬ 
ments  that  are  electronically  controlled  to  their  commanded  angle  by  hun¬ 
dreds  of  phase  shifters  (2  elements  per  shifter). 

The  radar  system  performance  is  highly  tolerant  to  random  failures 
of  array  elements  or  phase  shifters  across  the  radar  aperture.  This 


characteristic  rasults  in  gradual,  but  acceptable,  degradation  of  radar 
performance  and,  thus,  assures  high  availabiJity.  It  permits  the  estab¬ 
lishment  of  a  deferred  maintenance  approach  (i.e.,  numerous  missions 
can  be  flown  without  the  need  to  repair  individual  array  elements  or 
phase  shifters  until  the  peak  radiated  sideiobes  degrade  beyond  their 
acceptable  limits).  The  array  is  also  mechanically  slewed  by  two  servo 
motors  that  provide  system  fault  tolerance.  In  the  event  that  one  motor 
fails  to  operate,  the  remaining  motor  rotates  the  antenna  at  reduced  slew 
rates.  An  inertial  measurement  unit  (IMU)  is  used  to  measure  the  an¬ 
tenna  location  and  is  critical  to  mission  success.  The  IMU  is  a  non- 
redundant  analog  device  that  has  demonstrated  an  excellent  field  reliabil¬ 
ity  record  in  similar  applications.  Alternate  redundant  design  approach¬ 
es  were  investigated  to  increase  the  fault  tolerance  of  the  IMU.  It  was 
concluded  that  the  additional  hardware  complexity  made  redundancy  im¬ 
practical  and  not  cost  effective.  Therefore,  a  decision  was  made  to  use 
a  non-redundant  IMU  configuration  and  tolerate  the  infrequent  system 
failures. 

The  radar  transmitters  of  the  airborne  radar  surveillance  system 
utilize  coolanol  liquid  to  safely  limit  equipment  temperatures.  Opening 
coolanol  lines  in-flight  is  not  recommended  from  either  a  maintenance  or  a 
safety  point-of-view.  This  precluded  the  in-flight  repair  of  the  trans¬ 
mitters  and  necessitates  a  fault  tolerant  approach  for  these  relatively 
high  failure  rate  equipments.  The  configuration  selected  contains  four 
transmitter  units,  all  of  which  are  required  for  full  mission  capability. 
However,  if  one  or  two  transmitters  should  fail,  acceptable  degraded 
mission  capability  still  remains  although  certain  enemy  targets  may  not  be 
detectable.  Two  active  radar  data  processors  provide  fault  tolerance  ca¬ 
pability.  In  the  event  of  a  processor  failure,  the  operating  unit  can 
process  all  the  radar  data,  but  at  a  reduced  data  rate. 

Other  radar  equipments  are  designed  to  accommodate  in-flight  re¬ 
pair.  The  radar  control  unit,  receivers,  signal  preprocessor,  A/D  con¬ 
verters  and  data  processors  are  all  essentially  non-redundant  equipments 


that  can  b«  easily  repaired  in-flight.  The  use  of  in-flight  repair  ca¬ 
pability,  in  lieu  of  equipment  redundancy,  has  the  advantages  of  lower 
weight,  volume  and  system  complexity.  This  approach  must  be  balanced 
against  the  operator  workload  requirements  to  optimize  the  mix  of  fault 
tolerance  and  in-flight  repair  and  minimize  LCC  white  achieving  the 
mission  objectives.  Diagnostic  routines  identify  the  failed  SRU  or  LRU. 
In  most  cases,  indicator  lamps  identify  the  failed  hardware.  Commonality 
of  replacement  modules  is  stressed  throughout  the  design  phase  to 
reduce  the  number  of  on-board  spares  required.  Accessibility  features 
are  incorporated  to  permit  direct  access  to  each  SRU  or  LRU  without 
prior  removal  of  other  components. 

3.3  FAULT  TOLERANCE  REQUIREMENTS  CHECKLIST 

Air  Force  and  contractor  program  managers  should  evaluate  the  ra¬ 
tionale  used  to  establish  fault  tolerance  requirements  by  using  the  fol¬ 
lowing  checklist: 

a.  Are  the  fault  tolerance  requirements  based  on  the  mission  and 
safety  critical  functional  requirements? 

b.  What  is  the  mission  criticality  (national  security,  critical,  essen¬ 
tial,  non-essential)  of  the  C3I  system?  Are  the  fault  tolerance 
requirements  appropriate?  (PA) 

c.  Does  the  system  have  multiple  missions  with  different  functional 
criticalities  that  require  different  fault  tolerance  requirements? 

d.  Are  the  fault  tolerance  requirements  for  safety  critical  functions 
adequate? 

e.  Are  the  overall  fault  tolerance  requirements  too  extensive  for  the 
system?  Can  they  be  reduced  to  save  program  cost  and  reduce 
the  logistic  requirements? 

f.  Are  the  fault  tolerance  requirements  consistent  with  the  expected 
operational  use? 

e  Is  the  normal  system  operation  active  or  standby? 
e  What  is  the  intended  utilization  cycle  of  the  system  (8 
hours/day,  24  hours/day,  continuous,  on-demand)? 


3-16 


•  What  critical  aystam  functions  warrant  continuous  monitor- 
Ihg? 

•  What  systam  functions  ara  normally  activa?  What  systam 
functions  ara  normally  passiva  or  operating  in  a  standby 
mode? 

g.  Ara  the  fault  tolerance  requirements  appropriate  for  the  operat¬ 
ing  environments,  i.e.,  post  nuclear  blast  operation,  airborne, 
spaceborne,  ground  based,  attended,  unattended,  etc.? 


3-17/18 


-■ '  v»*  •  r;  • .  * 


4  -  GUIDANCE  FOR  DESIGN  OF  FAULT  TOLERANCE 


Hardware  and  software  redundancy  techniques  constitute  design  op¬ 
tions  that  can  be  selectively  employed  to  satisfy  fault  tolerant  system  de¬ 
sign  objectives.  This  section  provides  an  overview  of  many  of  these 
techniques  and  summarizes  their  advantages,  disadvantages  and  R/M/T 
impacts.  The  increasingly  important  issues  of  fault  detection,  distrib¬ 
uted  processing  and  the  impact  of  switching  are  addressed.  Reference 
is  made  to  the  Fault  Tolerant  Design  Implementation  Guide  for  detailed 
information  pertaining  to  hardware  and  software  redundancy  techniques. 

4.1  HARDWARE  AND  SOFTWARE  FAULT  TOLERANCE  DESIGN  OPTIONS 
A  designer  may  choose  from  a  variety  of  fault  avoidance  and  fault 
tolerance  design  techniques  to  satisfy  a  system  reliability  or  availability 
requirement.  The  key  elements  of  fault  tolerance  and  fault  avoidance 
are  depicted  in  Fig.  4-1. 

Reliability  improvement  or  fault  avoidance  techniques  in  many  appli¬ 
cations  prove  to  be  the  least  expensive  approach  to  attaining  a  reliability 
goal  provided  they  are  introduced  early  in  the  design  process.  In  a 
simplex,  (non -redundant)  system,  these  techniques  are  to: 
e  Obtain  higher  quality  parts/components 
e  Increase  design  safety  margins/parts  derating 
e  Exercise  error- reducing  design  practice,  such  as  shielding  and 
grounding 

e  Improve  and  control  the  operating  environment  through  cooling, 
heating  and  isolation 
e  Improve  us^r/operator  proficiency. 


4-1 


Fault  tolaranca  techniques  ara  applied  whan  tha  required  raliabllity 
cannot  ba  obtalnad  with  a  simplex  tyatam.  Initially,  hardware  and  soft¬ 
ware  redundancy  are  incorporated  in  the  system  design  in  order  to  main¬ 
tain  system  operation  even  if  a  fault  has  occurred.  This  redundancy 
can  take  the  form  of  additional  hardware  components  or  the  use  of  tech¬ 
niques  that  serve  to  delay  processing  time.  Hardware  redundancy,  the 
most  familiar  form,  uses  on-line,  hardwired  or  off-line  components  con¬ 
figured  either  as  standby  or  spare  units.  Time  delay  techniques  are 
utilized  primarily  in  software  and  permit  retransmit,  recompute,  rollback 
or  retry  methods  of  system  operation. 

In  general,  fault  tolerance  design  techniques  fall  into  two  categor¬ 
ies:  fault  masking  and  fault  reaction.  In  early  applications,  fault  mask¬ 
ing  utilized  multiple  hardware  redundancy  in  either  dual,  triple  or  qua¬ 
druple  circuit  configurations.  In  this  form,  the  functional  intercon¬ 
nections  remained  fixed  while  failures  consumed  the  components  until  all 
alternate  paths  were  exhausted.  Fault  detection  was  not  utilized  in  con¬ 
junction  with  hardware  redundancy,  and  no  intervention  was  made  from 
outside  the  circuit  to  enable  switching  or  reconfiguration.  Today,  these 
hardware  redundancy  techniques  are  still  employed  but  hardware/soft¬ 
ware  fault  masking  often  utilizes  fault  detection  to  initiate  system 
reconfiguration.  Switching  to  standby  or  spare  units  is  an  example  of 
hardware  masking,  whereas,  the  use  of  error  detection  and  correction 
code  is  an  example  of  software  fault  masking. 

In  all  cases,  failure  detection  is  the  initial  step  in  implementing 
fault  reaction  techniques.  Detection  alone  does  not  provide  fault  toler¬ 
ance  with  continued  system  operation.  The  fault  must  be  corrected  or 
the  operator  informed  so  an  alternate  means  of  operation  may  be  provid¬ 
ed.  The  fault  correction  techniques  or  fault  reaction  "strategies"  can  be 
categorized  in  two  forms:  masking  redundancy  or  dynamic  redundancy. 
Masking  redundancy,  in  the  fault  reaction  sense,  uses  both  detection 


and  correction  techniques.  It  is  also  considered  “static"  in  that  it 
employs  built-in  hardware  for  detection,  switching,  and  data  error  cor¬ 
rection  and  requires  no  interaction  with  equipment  located  outside  the 
subsystem  or  module.  Dynamic  redundancy  techniques  provide  reconfig¬ 
uration  of  the  remaining  system  elements  around  the  failed  element(s). 
These  rely  on  the  ability  to  fault  detect  and  isolate  the  failed  ele¬ 
ments). 

Some  of  the  more  commonly  used  hardware  implementations  for  mask¬ 
ing  and  dynamic  redundancy  are  discussed  in  paras.  4.1.1  through 
4.1.7.  Paragraph  4.1.8  discusses  failure  detection  as  part  of  software 
fault  tolerance.  Paragraph  4.1.9  presents  the  characteristics  of  error 
detection  and  correction  codes.  Paragraph  4.1.10  discusses  fault  toler¬ 
ant  design  implementation  in  distributed  processing  systems. 

4.1.1  Redundancy  Techniques 

In  reliability  engineering,  redundancy  is  the  design  technique  of 
providing  more  than  one  means  of  accomplishing  a  given  system  function; 
i.e.,  all  paths  must  fail  before  the  system  fails  to  perform  the  required 
function.  The  alternate  means  by  which  the  function  is  accomplished 
need  not  be  identical  to  the  primary  means.  Redundancy  is  implemented 
to  increase  the  probability  of  system  success  where  the  reliability  of  a 
nonredundant  design  is  inadequate  to  meet  the  mission  or  system  re¬ 
quirements.  The  NASA  Space  Shuttle  program  is  an  excellent  example  of 
the  extensive  use  of  redundancy  to  achieve  program  goals.  The  Shuttle 
uses  four  computers  which  are  configured  as  a  redundant  set  for  all 
critical  mission  phases,  and  a  fifth  computer  that  contains  a  backup 
flight  software  package  and  also  performs  non-critical  tasks.  The  con¬ 
figuration  is  similar  to  NMR/simplex  with  the  outputs  of  the  four  primary 
computers  voted  at  the  control  actuators.  Each  primary  computer  moni¬ 
tors  the  outputs  of  the  four  remaining  computers,  with  the  redundancy 


4-4 


management  circuitry  in  each  primary  computer  voting  to  remove  the 
faulty  computer  from  service. 

Often,  redundancy  is  implemented  to  provide  fault  tolerant  designs 
so  that  safety  requirements  can  be  met.  The  decision  to  use  redundant 
design  techniques  must  be  contingent  on  a  tradeoff  analysis  involving 
mission  effectiveness,  safety  and  cost,  since  additional  equipment  will 
increase  maintenance  expense.  Redundancy  may  be  the  only  available 
technique  after  reliability  improvement  techniques  (e.g.,  derating,  de¬ 
sign  simplification,  or  substitution  with  higher  quality  parts)  are  shown 
to  be  incapable  of  satisfying  program  requirements.  As  an  example,  in¬ 
corporating  redundant  elements  may  be  the  best  approach  for  meeting 
reliability  goals  for  high  earth  orbit,  long-mission-duration  satellites  for 
which  in-orbit  maintenance  is  not  feasible.  When  on-line  maintenance  is 
planned,  redundant  designs  permit  repair  of  failed  equipment  without 
loss  of  system  uptime.  Because  of  the  increase  in  system  complexity  and 
cost,  the  use  of  redundancy  with  an  on-line  maintenance  concept  is 
normally  limited  to  critical  applications. 

Redundancy  can  be  incorporated  at  various  assembly  levels,  as 
shown  by  the  examples  below. 


ASSEMBLY  LEVEL 

EXAMPLES 

Part 

Micro  electronic  circuit,  transistor, 

relay  contacts- 

Circuit 

Flip-flops,  logic  array 

Functional 

Adders,  counters 

Subassembly 

Arithmetic  unit,  memory,  CPU 

Equipment 

Computer,  gyro,  accelerometer 

Subsystem 

Radar,  communications 

System 

Reconnaissance  Spacecraft  (constella¬ 
tion) 

4-5 


Incorporating  high  level  active  redundancy  within  VLSI  and  VHSIC 
microcircuit  chips  is  a  significant  advance  in  the  tools  available  for  fault 
tolerant  design.  However,  common  mode  failures,  such  as  a  hermetic 
seal  failure  on  a  chip,  can  cause  the  loss  of  the  entire  chips  function. 
Thus,  common  mode  failures  become  even  more  significant  in  system  de- 
signs  when  relying  on  active  redundancy  within  VLSI  and  VHSIC  micro- 
circuit  chips.  Program  managers  should  require  that  reliability  analyses, 
such  as  a  failure  mode  and  affects  analysis  (FMEA),  be  conducted  early 
in  the  design  process  to  identify  critical  failure  modes  and  potential  com¬ 
mon  mode  failures.  This  helps  to  uncover  any  potentially  serious  design 
problems  in  a  timely  manner. 

The  inherent  reliability  estimates  of  the  lowest  level  functional  ele¬ 
ment  must  be  calculated  early  in  the  design  process.  These  estimates 
provide  the  essential  inputs  to  the  reliability  models  for  alternate  redun¬ 
dancy  configuration  candidates.  Reliability  analysis  using  these  models 
assists  in  reducing  the  number  of  candidate  redundancy  schemes  capable 
of  satisfying  the  system  reliability  requirement.  The  mathematical  models 
for  several  redundancy  configurations  are  included  in  the  Fault  Tolerant 
Design  Implementation  Guide. 

The  penalties  associated  with  the  application  of  redundancy  include 
increased  maintenance,  weight,  space  requirements,  complexity,  cost, 
spares,  and  time  to  design.  The  increase  in  complexity  results  in  the 
increased  frequency  of  unscheduled  maintenance.  Thus,  safety  and 
mission  reliability  ara  improved  at  the  expense  of  components  added  to 
the  maintenance  chain.  However,  the  increase  in  maintenance  may  be 
countered  by  introducing  reliability  and  maintainability  improvement  tech¬ 
niques,  such  as  modularity,  design  simplification,  component  derating, 
and  the  use  of  more  reliable  components. 


4-6 


Air  Fore*  program  manager*  should  Insure  that  tha  SOW  requires 
the  performance  and  documentation  of  trade  studies  when  multiple  redun¬ 
dancy  strategies  are  being  considered.  The  tradeoff  process  will  help 
the  design  engineer  determine  the  most  effective  redundancy  alternative. 
In  the  tradeoff  process,  it  may  be  determined  that  adding  certain  types 
of  redundant  equipment  may  impact  the  cost  of  preventive  maintenance. 
The  cost  of  this  preventive  maintenance  may  become  a  significant  factor 
in  the  systems  total  LCC.  Redundancy  may  be  easily  implemented  if  the 
redundant  item  is  available;  may  be  very  feasible  if  the  redundant  item 
is  economical  when  compared  to  the  cost  of  redesign  alternatives;  may 
not  be  viable  if  the  item  is  extremely  costly  or  if  aircraft/spacecraft 
weight,  volume,  or  power  limitations  are  exceeded.  In  any  event,  the 
designer  should  consider  all  these  factors  when  using  redundancy  to  im¬ 
prove  the  reliability  of  critical  items  (of  low  reliability)  for  which  a  sin¬ 
gle  failure  can  cause  the  loss  of  a  system  or  a  major  function. 

Incorporating  redundancy  to  achieve  increased  reliability  requires 
an  effective  fault  detection  and  isolation  scheme.  Isolation  is  necessary 
to  prevent  failure  effects  from  adversely  affecting  other  parts  of  a  re¬ 
dundant  network.  For  example,  failed  data  processing  elements  must  be 
isolated,  erroneous  data  must  be  prevented  from  contaminating  data 
bases,  data  base  corruption  sources  must  be  identified,  and  provisions 
must  be  made  to  prevent  the  writing  of  illegal  codes  to  memory,  as  well 
as  the  writing  of  legal  but  incorrect  codes  to  memory.  Air  Force  pro¬ 
gram  managers  should  ensure  that  the  SOW  requires  a  FMECA  be  per¬ 
formed  at  a  sufficiently  low  (detailed)  level  to  uncover  any  susceptibility 
of  failure  propagation  in  redundant  designs. 

Testability  must  be  considered  when  incorporating  redundancy  into 
a  design.  In  fact,  some  circuits  may  not  be  checkable  prior  to  mission 
start  because  of  redundancy  inclusion.  Without  an  adequate  functional 
test  prior  to  mission  start,  it  may  be  possible  to  determine  that  only  one 


4-7 


of  the  redundant  circuits  la  functional.  In  this  sense,  pre-mission  fail- 
uraa  could  ba  masked  by  a  redundant  item,  thus,  defeating  the  purpose 
of  redundancy.  Ciearly,  this  is  contradictory  to  the  purpose  of  adding 
redundancy  to  improve  mission  reliability.  If  it  can  not  be  determined 
that  each  of  the  redundant  elements  is  operational  prior  to  mission  start, 
then  the  design  must  be  questioned.  Air  Force  program  managers 
should  insure  that  the  Statement  of  Work  and  development  specifications 
adequately  address  BIT  planning  and  inclusion  of  test  points,  etc.  when 
redundancy  is  anticipated  in  the  system  design. 

Figure  4-2  presents  a  summary  of  several  fault  tolerant  design  op¬ 
tions  with  the  associated  R/M/T  impacts  and  typical  applications  to  cur¬ 
rent  and  future  C9I  systems.  Tables  4-1  and  4-2  summarize  the  charac¬ 
teristics  of  some  fault  tolerant  design  options  implemented  in  software. 

4.1.2  Active  Redundancy 

Active  (parallel)  redundancy  is  a  design  technique  where  one  or 
more  continuously  energized  redundant  elements  are  added  to  the  basic 
system  so  that  the  function  continues  to  be  performed  as  long  as  one  el¬ 
ement  remains  operative. 

Simple  active  redundancy  is  configured  with  identical  redundant  ele¬ 
ments  having  the  same  failure  rate.  Active  redundancy  configurations 
also  include  parallel  redundant  elements  of  unequal  failure  rates  as  well 
as  series-parallel/parallel-series  redundant  elements.  These  and  other 
active  redundancy  configurations,  their  corresponding  mathematical  mod¬ 
els,  advantages  and  disadvantages  are  discussed  in  Section  7  of  MIL- 
HDBK-338  and  in  the  Fault  Tolerant  Design  Implementation  Guide. 

Exercising  these  mathematical  models  will  establish  whether  a  re¬ 
quired  probability  of  mission  success  within  a  given  operating  time  can 
be  satisfied  through  a  selective  application  of  active  redundancy  config- 


4-8 


MULT  TOLERANT  DEMON  OPTIONS 


NON  REOUNOANT 
(SIMPLEX) 


EACH  ANO  EVERY  UNIT 
DEPICTED  IN  THE  SERIES 
CHAIN  IS  REQUIRED  FOR 
MISSION  SUCCESS 


ACTIVE  (SIMILAR)  REDUNDANCY 


CONSISTS  OF  A  NUMBER(n) 

OF  IDENTICAL,  CONTINUOUSLY 
OPERATING  UNITS  &  ONLY 
ONE  IS  REQUIRED  FOR 
MISSION  SUCCESS 


ACTIVE  (DISSIMILAR)  REDUNDANCY 


CONTINUOUSLY  OPERATING 
UNITS  HAVE  UNFQUAL  FAILURE 
RATES  <M  A  ONLY  ONE  IS 
REQUIRED  FOR  MISSION  SUCCESS 
(SAME  AS  ACTIVE  (SIMILAR)  BUT 
NON-IOENTICAL  UNITS  UTILIZED) 


STANDBY  (SIMILAR)  REDUNDANCY 


MAMTMNAMUTY  I 


I 


•  UNABLE  TO  ATTAIN  HIGH  SYSTEM  RELIABILITY 
FOR  SYSTEMS  CONTAINING  COMPLEX 
EQUIPMENT  OR  LONG  DURATION  OPERATIONS 


MINIMAL  SPARES 
REQUIRED  COMP/ 
SYSTEMS 


•  ACCEPTABLE  SYSTEMS  RELIABILITY  MAY  BE 
ACHIEVED  WITH  HIGH  RELIABILITY  EQUIPMENT 
it  SHORT  OPERATING  TIMES 


•  HIGH  SYSTEMS  RELIABILITY  CAN  BE  ATTAINED 

•  SEVERE  IMPACT  0 

WITHOUT  SYSTEMS  INTERRUPTION 

PERSONNEL  SINCI 
OPERATING  CONTI 

•  POTENTIAL  COMMON  FAILURE  MODE  (OR 

THREAT)  CAN  IMPACT  ALL  REDUNDANT  UNITS 

•  HIGH  SYSTEMS  RELIABILITY  CAN  BE  ATTAINED 

•  COMPLICATION  OF 

WITHOUT  SYSTEMS  INTERRUPTION 

DIFFERENT  UNITS 

•  NORMALLY  LESS  SUSCEPTIBLE  TO  COMMON 

FAILURE  MODE  OR  THREAT  ENVIRONMENT 

CONSISTS  OF  A  SINGLE 
CONTINUOUSLY  OPERATING 
PRIMARY  UNIT,  A  NUMBER  (n) 
QUIESCIENT  IDENTICAL  UNIT(t)  AND 
A  SWITCH.  THE  QU I  ESC  I  ENT/STANDBY 
UNIT(S)  ARE  NOT  OPERATIONAL  UNTIL 
SWITCHED  IN  UPON  FAILURE  OF  THE 
PRIMARY  UNIT.  ONLY  ONE  UNIT  IS 
REQUIRED  FOR  MISSION  SUCCESS 


VERY  HIGH  SYSTEMS  RELIABILITY  CAN  BE 
ACHIEVED  COMPARED  TO  ACTIVE 
REDUNDANCY  IF  SYSTEMS  INTERRUPT  FOR 
STANDBY  UNIT  "WARM-UP"  &  "SWITCH-IN" 
IS  ACCEPTABLE 


MINIMAL  SPARES  I 
rwQUIRED  FOR  HI 
STANDBY  UNITS  Al 
LESS  LIKELY  TO  Ft 


POTENTIAL  COMMON  FAILURE  MODE  OR 
THREAT  CAN  IMPACT  ALL  REDUNDANT  UNITS 


R87-3337-014(l/2)(T) 


ABILITY  IMPACT 

TESTABILITY  IMPACT 

TYPICAL  APPLICATIONS 

SPARES  4  MAINTENANCE  PERSONNEL 
ED  COMPARED  WITH  REDUNDANT 

S 

•  SELF  CHECK  CAPABILITY  SHOULD  BE  PROVIDED 
ON  A  NON-SYSTEMS  INTERRUPT  BASIS 

•  LESS  COMPLEX  FAULT  DETECTTON/ISOLATTON 
COMPARED  TO  REDUNDANT  SYSTEMS 

*  LOW  CRITICALITY  APPLICATIONS  OR  WHERE 
REPAIR  CAN  BE  RAPIDLY  ACCOMPLISHED  TO 
MINIMIZE  DOWNTIME 

•  RELIABLE  EQUIPMENT  WITH  SHORT  OPERATING 
TIME 

•  SYSTEMS  WITH  CONSTRAINTS  IN  COST. 

WEIGHT,  VOLUME 

IMPACT  ON  SPARES  &  MAINTENANCE 
NEL  SINCE  ALL  UNITS  ARE 

N6  CONTINUOUSLY 

•  DIFFICULT  TO  DETECT  A  FAULT  IN 

REDUNDANT  ELEMENTS  WITHOUT  A 
REDUNDANCY  MANAGEMENT  SCHEME  SUCH 

AS  COMPARISON  MONITORING.  VOTING,  ETC. 

•  HIGH  CRITICALITY  APPLICATIONS  WHERE 

REPAIR  CANNOT  BE  ACCOMPLISHED  AND 

WHERE  SYSTEMS  OPERATION  CANNOT  BE 
INTERRUPTED 

•  SELF  TEST  CAPABILITY  SHOULD  BE 

PROVIDED  FOR  EACH  REDUNDANT  ELEMENT 

•  COMPUTER  PROCESSING, 

COMMUNICATIONS  NETWORKS 

NATION  OF  SPARING  &  MAINTENANCE  OF 
NT  UNITS 

•  DIFFICULT  TO  DETECT  A  FAULT  IN 

REDUNDANT  ELEMENTS  WITHOUT  A 
REDUNDANCY  MANAGEMENT  SCHEME  SUCH 

AS  COMPARISON  MONITORING.  VOTING.  ETC. 

•  HIGH  CRITICALITY  APPLICATIONS  WHERE 

REPAIR  CANNOT  BE  ACCOMPLISHED  OR  WHERE 
SYSTEMS  OPERATION  CANNOT  BE 

INTERRUPTED 

•  SELF-TEST  CAPABILITY  SHOULD  BE 

PROVIDED  FOR  EACH  REDUNDANT  ELEMENT 

•  ADDITIONAL  SOFTWARE  TESTING  REQUIRED 

.  APPLICATIONS  WHERE  CONCERNS  EXIST  FOR 
COMMON  MODE  FAILURE  OR  THREAT 
ENVIRONMENT 

SPARES  &  MAINTENANCE  PERSONNEL 
ED  FOR  HIGH  RELIABLE  SYSTEMS  SINCE 
1  UNITS  ARE  NON  OPERATIVE  &  ARE 

CELY  TO  FAIL 

•  DIFFICULT  TO  DETECT  A  FAULT  IN 

REDUNDANT  ELEMENTS  WITHOUT  A 
REDUNDANCY  MANAGEMENT  SCHEME  SUCH 

AS  COMPARISON  MONITORING.  VOTING.  ETC. 

•  HIGH  CRITICALITY  APPLICATIONS  WHERE 

REPAIR  CANNOT  BE  ACCOMPLISHED  &  WHERE 
SYSTEMS  INTERRUPT  FOR  "SWITCH-IN"  IS 
ACCEPTABLE 

•  SELF-TEST  CAPABILITY  SHOULD  BE 

PROVIDED  FOR  EACH  REDUNDANT  ELEMENT 

Fioura  4-2.  Fault  Tolaranca  Datigns  Option*. 
(Shaat  1  of  2) 


4-9/10 


I 


FAULT  TOLBRANT  DOtQN  OPTIONS 


mamdunamuty  IMPACT 


STANDBY  (DtSSSBnJUL)  REDUNDANCY 


THE  PRIMARY  AND  STANDBY 
UNITS  ARE  DISSIMILAR,  HAVING 
UNEQUAL  FAILURE  RATES  (X). 
ONLY  ONE  UNIT  IS  REQUIRED 
FOR  MISSION  SUCCESS 


HIGHCR  8YSTCM6  RELIABILITY  CAN  BE 
ACHIEVED  COMPARED  TO  ACTIVE 
REDUNDANCY  WITH  SYSTEMS  INTERRUPT  FOR 
STANDBY  UNIT  ^MMBS-UP*  S  "SWITCH-IN" 

NORMALLY  LESS  SUSCEPTIBLE  TO  COMMON 
FAILURE  MODE  OR  THREAT  ENVIRONMENT 


COMPLICATION  OP  SRARIKKf 
OP  DIFFERENT  UNITS  COMR 
SIMILAR  STANDBY  REDUND^ 


VOTING  REDUNDANCY 


ELEMENTS  OUTPUT  STATE  IS 
DETERMINED  BY  STATE  OF 
MAJORITY  OF  INPUTS  DETERMINED 
8Y  VOTER  (V) 


CAN  PROVIDE  A  SIGNIFICANT  GAIN  IN 
SYSTEM  RELIABILITY  FOR  SHORT  MISSION 
DURATIONS 

POTENTIAL  COMMON  FAILURE  MODE  OR 
THREAT  CAN  IMPACT  ALL  REDUNDANT 
ELEMENTS 

REQUIRES  VOTER  RELIABILITY  SIGNIFI¬ 
CANTLY  BETTER  THAN  ELEMENT  RELIABILITY 
SYSTEM  OPERATION  CONTINUES  UNINTER¬ 
RUPTED  DUE  TO  VOTING  LOGIC  PROVIDING 
A  HIGH  CONFIDENCE  OP  MASKING  A  SINGLE 
FAULTY  ELEMENT 


SEVERE  IMPACT  ON  SPARE 
MAINTENANCE  PERSONNEI 
ALL  UNITS  ARE  OPERATINO 
CONTINUOUSLY 


4 


HYBRID  REDUNDANCY 


DEFECTIVE  ACTIVE  UNITIS) 
DETECTED  BY  VOTER  (V)  AND 
REPLACED  BY  IN)  STANDBY 
SP'aRE  UNITIS). 


•  VERY  HIGH  RELIABILITY  CAN  BE  ACHIEVED 
WITHOUT  SYSTEMS  INTERRUPTION  rOR  VERY 
LONG  MISSION  DURATIONS 

•  PROVIDES  HIGH  CONFIDENCE  IN  THE 
CONTINUED  ABILITY  TO  MASK  FAULTS  BY 
REPLACING  FAULTY  "VOTED  OUT"  UNITS 


•  SEVERE  IMPACT  ON  SPARES 
PERSONNEI.  DUE  TO  MULTIPl] 

•  IDEAL  CONFIGURATION  FOR 
MAINTENANCE  Pr  Tf 


HIGH  RELIABILITY  CAN  BE  ACHEIVED 
WITHOUT  SYSTEMS  INTERRUPTION  AND 
WITH  MODEST  INCRFASE  IN  SYSTEM 
RESOURCES 


SEVERE  IMPACT  ON  SPARES 
MAINTENANCE  PERSONNEL  I 


ARE  CONTINUOUSLY  OPERA. 


OF  N  IDENTICAL  ACTIVE 
UNITS.  K  UNITS  MUST 
FUNCTION  FOR  MISSION 
SUCCESS  H.|..  2  of  3, 

3  ol  4.  *tt.) 


AOCIPTMLE  OBQAADCD  MOOCS  OF  OPERATION 
A  GRACEFUL  OCQRAOAnOM 

MODE  "X"vl*  DEGRADED  MODES-*- 

_ FULL _ 

performance 
PERFORMANCE 


NORMALLY,  SYSTEMS  WITH  DEGRADED  MODES 
OR  GRACEFUL  DEGRADATION  CAN  ACHIEVE 
HIGH  RELIABILITY  LEVELS  WITH  MINIMAL 
INCREASES  IN  HARDVJARE  RESOURCES 


MINIMAL  SPARES  A  MAINTENAB 
PERSONNEL  REQUIRED  FOR  H) 
SYSTEMS  COMPARED  WITH  OR 
REDUNDANCY  TECHNIQUES  | 

IDEAL  CONFIGURATION  FOR  A  I 
MAINTENANCE  POLICY  I 


RJ7.3S374U4(«/,yrryATIN0T,ME- 


/ 


IMPACT 


TSSTRBIUTY  MMCT 


TYPICAL  APPLICATIONS 


1  OF  SPARING  ft  MAINTENANCE 

units  compareo  with 
*¥  REDUNDANCY 

•  SELF-TEST  CAPABILITY  SHOULD  K  PROVIDED 
FOR  EACH  REDUNDANT  ELEMENT 

•  HWHCRmCAUTY  APPLICATIONS  WHIRS 

RSFAiR  CANNOT  BS  ACOOMPLMMD  *  WHERE 
8VSTMB  INTERRUPT  FOR  HMTCFMN’IS 
ACCEPTABLE 

•  APPLICATIONS  Wl  HRS  CONCERNS  EXIST  FOR 
COMMON  MOOS  FAILURES  OR  THREAT 

fdyujuuwju 

TON  SPARES* 

1  personnel  since 

E  OPERATING 

Y 

•  SELF  TEST  CAPABILITY  SHOULD  BE 

PROVIDED  FOR  EACH  REDUNDANT 

ELEMENT 

•  HIGH  CRITICALITY  APPLICATIONS 

WHBRt  REPAIR  CON  NOT  BE 

ACCOMPLISHED  AND  WHERE  SYSTEMS 
OPERATION  CANNOT  EE  INTERRUPED 

T  ON  SPARES  *  MAINTENANCE 

JE  TO  MULTIPLE  ACTIVE  UNITS 

JRATION  FOR  A  DEFERRED 

POLICY 

•  DIFFICULT  TO  DETECT  A  LATENT  FAULT  IN 
REDUNDANT  ELEMENTS  WITHOUT  A 
REDUNDANCY  MANAGEMENT  SCHEME 

•  SELF-TEST  CAPABILITY  SHOULD  BE 

PROVIDED  FOR  EACH  REDUNDANT  ELEMENT 

•  HIGH  CRITICALITY  APPLICATIONS  NORMALLY  OF 
LONG  MISSION  DURATION  WHERE  HIGH 
CONFIDENCE  IN  THE  ABILITY  TO  MA8K  FAULTY 
OUTPUTS  IS  E88ENTIAL 

CT ON  SPARES* 

E  PERSONNEL  SINCE  ALL  UNITS 
IOUSLY  OPERATING 

•  DIFFICULT  TO  DETECT  A  FAULT  IN 

REDUNDANT  ELEMENTS  WITHOUT  A 
REDUNDANCY  MANAGEMENT  SCHEME  SUCH 

AS  COMPARISON  MONITORING.  ETC. 

•  SELF-TEST  CAPABILITY  SHOULD  BE 

PROVIDED  FOR  EACH  REDUNDANT  ELEMENT 

•  HIGH  CRTTICAUTY  APPLICATIONS  WHERE 

REPAIR  CANNOT  BE  ACCOMPLISHED  AND 

WHERE  SYSTEM  OPERATION  CANNOT  BE 
INTERRUPTED 

;S  4  MAINTENANCE 

EQUIRED  FOR  HIGH  RELIABLE 
PARED  WITH  OTHER 

techniques 

^RATION  FOR  A  DEFERRED 

POLICY 

•  SYSTEM  SHOULD  BE  DESIGNED  TO  DETECT  A 
THRESHOLD  ABOVE  THE  MINIMUM  ACCPE1ABLE 
PERFORMANCE  LEVEL 

•  RESTRICTED  TO  THOSE  TECHNICAL  AREAS 
WHERE  THIS  APPROACH  IS  APPLICABLE.  I.E., 
PHASED  ARRAY  RADARS.  SOLAR  ARRAYS.  I.R. 
SENSORS,  ETC. 

Figurt  4-2.  Fault  Totoranct  Pttlgm  Opttom. 
(Shwt  2  ofH 


4-11/12 


VOWWOOMNW ON  AMLOQCLMNTt,  | MOOUll, FUNCTION,  |  mt-TO- 


WRAP-AROUND 

#HIV 

CII1CKHJM 

mm  otncnoN  ooon 

STNCNNONOmON 

vnwcmjoo  man 

lyawMOwunnt 

ANALYTIC  HKDUNDANCY 

DWWosnctonwM 

MtKKNoemy-mooucco 

aomiMK 

TOTALLY  SELF 
CHtCKMWFAULT 

3®cuna  iwtnorks 


R«7-3S37-01t(T) 


HttH 

LOWlOMmUM 

mourn 

bOWTOMMMUM 

tMH  .■ 

LOW 

MOM 

UOW 

MEDIUM 

MCOftJM 

LOW 

MEDIUM  TO  HIGH 

LOW 

LOW 

IKQII 

MCOHIM 

MEDIUM 

MEDIUM  TO 
HIQH 

LOW 

MKJH 

LOW 

HIQH 

HIQH 

mount 

TABLE  4*2.  Propwtlti  of  Error  Dt»ctton  Cod—. 


corn 

I——  'll 

1 

rQMPi  iwm 

TYPE 

OITBCTION 

rrgMifrTinM 

wvnnBviMV 

RARITY 

ANY  SINGLE-BIT  ERROR. 

NO  DOUBLE-BIT  ERRORS 

SOME  MULTIPLE,  ADACENT, 
UNIDIRECTIONAL  ERRORS 

NONE 

LOW 

HAMMING 

ANY  SINGLE-BIT  ERROR 

ANY  DOUBLE-BIT  ERROR 

SINGLE  BIT 

HIGH 

M-OF-N 

ANY  8MQLE-BIT  ERROR 

1-OP*  DOUBLE-BIT  ERRORS 

ANY  MULTIPLE  ADJACENT 

UNI- DIRECTIONAL  ERRORS 

NONE 

MEDIUM 

AN 

ANY  SINGLE-BIT  ERROR 

SINGLE  BIT 

LOW 

RE8IOUE-M 

ANY  8INGLE-6IT  ERROR 

SINGLE  BIT 

MEDIUM 

CYCLIC 

*•7-3337-013(1 

SINGLE-BIT  TO  MULTIPLE, 
RANOOMBITS. 

BURST  ERRORS. 

■> 

SINGLE  AND 
RANOOM  MULTIPLE 
SINGLE  BURST 

MEDIUM  TO 
HIGH 

uralions.  That*  configure  tfom  often  differ  In  weight,  volume,  power, 
cost  as  wrH  a*  in  maintenance  frequency,  maintainability  and  testability. 

Therefore,  AF  program  managers  should  insure  that  the  SOW  re¬ 
quires  the  development  of  accurate  reliability  models  so  that  comparisons 
and  tradeoffs  between  alternate  hardware  architectures  and  redundancy 
schemes  may  be  accomplished. 

4.1.3  Standby  Redundancy 

Standby  redundancy  is  a  design  technique  where  an  alternate  re¬ 
dundant  moans  of  performing  the  function  is  switched  in  when  it  is  de¬ 
termined  that  a  failure  has  occurred  in  the  primary  element  performing 
the  function.  This  differs  from  active  redundancy  in  that  the  redundant 
unit(s)  (or  elements)  are  not  operating  until  switched  into  the  system  as 
a  substitute  for  the  failed  primary  unit.  Switching,  therefore,  is  always 
required  to  activate  standby  redundant  units. 

Standby  elements  are  less  susceptible  to  failure  since  they  are  not 
operating  until  switched  in.  Therefore,  when  compared  to  active  redun¬ 
dancy,  higher  systems  reliability  can  be  achieved  if  system  complexity 
and  systems  interrupt  due  to  warm-up  and  switching  time  penalties  are 
acceptable.  Although,  only  one  redundant  element  is  required  to  operate 
in  the  system  for  mission  success,  self-test  capability  is  necessary  for 
all  elements  to  assure  fault  detection  capability. 

Standby  redundancy  may  be  implemented  at  various  assembly  levels, 
(e.g.,  part,  circuit,  functional,  sub-assembly,  equipment,  subsystem 
and  system).  However,  the  implementation  level  chosen  depends  to  a 
great  degree  on  an  analysis  of  the  switch  complexity  and  the  tradeoff 
conclusion.  In  addition  to  maintenance  cost  increases  for  repair  of  the 
additional  standby  elements,  the  system  probability  of  success  of  certain 


♦■*14 


standby  redundant  configurations  may  actually  bo  last  then  that  of  a 
tingle  element.  This  results  from  tho  impact  of  the  reliability  of  switch¬ 
ing  or  other  peripheral  devices  nssdad  to  switch-in  the  standby  redun¬ 
dant  ei ament (s) .  Care  must  bo  exercised  to  ensure  that  reliability  gains 
are  not  offset  by  increased  failure  rates  Ohio  to  switching  devices,  error 
detectors  and  other  peripheral  devices  needed  to  Implement  the  standby 
redundancy  configurations. 

The  effectiveness  of  standby  redundant  configurations  is  enhanced 
since  this  configuration  allows  repair  of  the  failed  unit  (white  operation 
with  the  good  unit  continues).  Through  continuous  or  comparative  moni¬ 
toring,  the  switchover  function  can  provide  an  indication  that  a  failure 
has  occurred  and  operation  continues  with  the  alternate  unit.  With  a 
positive  failure  indication,  delays  in  repair  can  be  minimized.  Ground- 
based  and  large  airborne  weapons  systems,  such  as  AWACS  and  Joint 
STARS,  are  examples  of  systems  that  utilize  on-line  repair  techniques  to 
enhance  availability. 

4.1.4  Voting  Redundancy 

Voting  redundancy  is  a  design  technique  in  which  the  element's 
output  state  is  determined  by  a  voter  or  comparator  that  compares  or 
analyzes  the  state  of  the  majority  of  the  inputs.  Faults  are  statically 
masked  in  voting  redundancy,  since  the  agreeing  outputs  are  selected  by 
the  voter  and  the  faulty  outputs  are  ignored.  Thus,  the  majority  of 
agreeing  outputs  (presumed  to  be  good)  allows  continuation  of  the  ele¬ 
ments  intended  function  without  interruption.  Voting  redundancy  must 
be  configured  with  an  odd  number  of  elements  to  avoid  the  possibility  of 
tie- vote  ambiguity.  Minimum  element  implementation,  called  triple  modu¬ 
lar  redundancy  (TMR),  outputs  the  result  of  two  or  more  of  th?ee  agree¬ 
ing  outputs  by  its  voter.  A  more  genera!  implementation,  N-modular  re¬ 
dundancy  (NMR),  outputs  the  majority  of  N  element  outputs  that  agree. 
Voting  may  be  applied  to  analog  and  digital  signals  and  is  commonly  ap¬ 
plied  at  the  module  level. 


4-13 


The  penalty  associated  with  N -modular  redundancy  includes  the 
complexity  (N)  times  the  basic  hardware  complexity  (cost,  weight  and 
power),  plus  the  complexity  of  the  voter.  The  voter  may  also  cause  a 
signal  propagation  delay  leading  to  a  decrease  In  performance.  To 
achieve  the  reliability  potential  of  NMR  configurations  it  ie  important  to 
prevent  the  voter  from  becoming  a  single  point  failure.  This  can  be 
overcome  by  Introducing  one  or  more  redundancy  techniques  into  the 
voter  design. 

4.1.5  Hybrid  Redundancy 

Hybrid  redundancy  is  a  dynamic  redundancy  technique  in  which 
failed  NMR  modules  (see  para.  4.1.4)  are  replaced  with  previously 
unused  spare  modules.  When  the  voter  detects  a  disagreement  in  a  hy¬ 
brid  redundant  system,  the  module  or  modules  in  the  minority  are  con¬ 
sidered  to  be  failed  and  are  replaced  by  an  equivalent  number  of  spare 
modules.  Thus,  a  fault  occurring  in  a  TMR  configuration  results  in  the 
triad  kaing  reconfigured  week  to  a  s ate  where  it  can  once  again  mask 
faults.  Hybrid  redundancy  overcomes  one  of  the  drawbacks  of  NMR 
since  the  fault  masking  capability  of  an  NMR  design  degrades  rapidly  as 
elements  fail  and  the  possibility  exists  for  a  collection  of  failed  elements 
to  out- vote  the  remaining  healthy  elements,  thereby  leading  to  premature 
system  failure.  Thus,  hybrid  redundancy  is  a  design  solution  to  meet 
stringent  system  reliability  requirements  of  uninterrupted  performance 
where  the  mission  duration  is  very  long  and  maintenance  is  not  possible. 

The  spare  modules  used  in  hybrid  redundancy  often  are  described 
as  pooled  spares  (i.e.,  they  are  not  dedicated  to  any  particular  module 
but  can  replace  any  module  when  called  upon).  Depending  on  the  appli¬ 
cation,  the  pooled  spares  can  be  cold,  hot,  or  flexed. 

Cold  spares  do  not  operate  until  they  are  switched  in.  Therefore, 
they  will  exhibit  a  lower  failure  rate  than  pooled  spares  that  are  powered 
(hot).  Consequently,  using  cold  pooled  spares  in  a  hybrid  redundancy 
configuration  results  In  higher  system  reliability  than  can  be  obtained  by 


4-16 


using  hot  spfres.  This  approach  often  provides  significant  advantages 
in  situations  of  long  duration  miasione  without  maintenance  (e.g.,  satei- 
iite  application) .  It  may  alao  raautt  In  fewer  spares,  lower  power  re¬ 
quirements,  and  reduced  weight  over  a  hot  aparing  strategy. 

Hot  pooled  spares  are  modules  or  equipment  that  are  powered  and 
operating  in  a  slave  mode.  These  may  be  shadowing  the  operating  ele¬ 
ments  of  the  NMR  core,  but  their  output  is  not  being  voted  upon. 
Thus,  delay  time  (to  reconfigure)  is  minimized.  The  advantage  of  a  hot 
atandby  architecture  (to  mask  failures)  is  that  takeover  by  the  stave  is 
virtually  instantaneous.  The  slave  needs  no  updates  because  it  is  doing 
the  same  tasks  as  the  master  NMR  core  elements.  Disadvantages  of  us¬ 
ing  hot  pooled  spares  are  increased  probability  of  failure  during  long 
duration  missions  and  increased  power  and  weight  required  for  a  given 
allocated  system  reliability.  In  many  applications  of  hybrid  redundancy, 
hot  standby  spares  may  be  inefficient  and  may  waste  resources,  since 
the  spares  are  dedicated  exclusively  to  the  functions  of  the  NMR  core. 
However,  where  the  penalty  of  failure  is  extreme,  such  as  those  affect¬ 
ing  national  security,  this  type  of  redundancy  may  be  appropriate. 

Flexed  spares  are  spare  elements  of  a  system  which  are  exercised 
periodically  and  systematically.  The  use  of  flexed  spares  reduces  the 
possibility  of  a  cold  spare  not  working  during  a  reconfiguration  attempt. 
For  maximum  effectiveness  and  confidence,  this  strategy  requires  that 
spare  buses,  modules,  voters,  power  supplies  and  clocks  be  periodically 
tested  during  the  mission. 

4.1.6  K  of  N  Configurations 

A  K  out  of  N  configuration  is  a  system  consisting  of  N  elements,  of 
which  at  least  K  elements  must  be  functioning  in  order  to  achieve  system 
mission  success.  All  N  elements  in  the  configuration  are  operating  in 
parallel,  similar  to  the  operation  of  a  system  configured  in  active  parallel 
redundancy  (see  para.  4.1.2).  However,  instead  of  requiring  only  one 


4-17 


of  tho  N  elements  to  function  (at  in  active  parallel  redundancy),  ail  K 
•laments  mutt  function  to  attain  system  mission  success.  Examples  of  K 
out  of  N  configurations  are: 

e  Spacecraft  attitude  control  thruster  engines 
e  Inertial  reference  assemblies 
e  Triple  modular  redundancy  (TMR). 

In  the  first  example  above,  a  spacecraft  may  be  designed  such  that  its 
attitude  control  is  maintained  with  any  of  8  (or  more)  of  18  thrusters 
functioning.  The  second  example  is  an  integrated  inertial  reference  as¬ 
sembly  designed  so  that  any  3  or  more  of  6  gyros  and  any  2  or  more  of 
4  operational  accelerometers  will  produce  an  accurate  inertial  reference 
function,  lest  is  an  example  of  triple  modular  redundancy  (see  para. 
4.1.4)  in  which  any  two  or  more  of  the  three  elaments  must  function 
(agree)  in  order  for  the  system  to  function  successfully. 

4.1.7  Graceful  Degradation 

Graceful  degradation  is  a  design  technique  which  utilizes  extra 
hardware  as  part  of  the  system's  normal  operating  resources  to  ensure, 
with  high  probability  of  success,  that  an  acceptable  (minimum)  perform¬ 
ance  level  can  be  maintained  in  the  presence  of  failures.  Therefore,  the 
extra  hardware  may  raise  system  performance  above  minimum  require¬ 
ments;  this  enhanced  performance  continues  as  long  as  the  extra  hard¬ 
ware  is  not  required  to  overcome  failure  effects.  Potential  failure  modes 
that  cause  only  a  partial  loss  of  functional  capability  may  require  lower 
levels  of  fault  tolerance,  thereby  reducing  hardware  complexity  and  the 
overall  system  cost.  The  extra  hardware  used  in  gracefully  degrading 
systems  differs  from  standby  redundant  and  hybrid  redundant  configura¬ 
tions  in  that  the  extra  hardware  contributes  to  normal  system  perform¬ 
ance  and  does  not  have  to  be  switched  in. 

Two  examples  of  gracefully  degrading  systems  are  large  C*l 
phased-array  radar  systems  and  distributed  processing  systems.  A 
phased-array  radar  antenna  typically  contains  a  large  number  of  trans¬ 
mitting  and  receiving  elements.  A  small  number  (typically  less  than  5%) 


of  randomly  ditpartad  failures  of  thasa  alamanta  hat  •  negligible  off  act 
on  system  performance,  and  additional  failures  can  be  compensated  for 
by  boosting  transmitter  power  or  receiver  gain.  An  even  larger  number 
(typically  lesa  than  10%)  of  random  element  failures  might  be  offset 
by  the  capability  of  the  surviving  elements  to  meet  minimum  acceptable 
system  performance  requirements  with  a  degraded  detection  capability  as 
illustrated  in  Figs.  4-2  and  4-3.  These  antennas  are  adaptable  to  a  de¬ 
ferred  maintenance  policy  wherein  failed  elements  need  not  be  repaired 
after  each  mission.  A  second  example  of  graceful  degradation  is  a  dis¬ 
tributed  data  processor  subsystem  in  which  the  network  contains  extra 
operating  processors  that  provide  additional  throughput.  If  any  processor 


TYPICAL  PHASED  ARRAY  RADAR 


Figurt  4-3.  Graceful  Dqndrtion  of  Antawnp  Rpcsim/Twwmtt  (R/T)  Modutw. 


4-19 


fails,  only  ths  excess  capacity  is  lost.  Tha  number  of  extra  procastors 
to  be  included  in  the  network  can  be  selected  yield  an  allocated  proba¬ 
bility  of  maintaining  at  least  minimal  system  functionality  through  the  and 
of  the  mission. 

Graceful  degradation  implies  that  element  failures  are  unlikely  to 
cause  extanslve  secondary  failures.  Limiting  secondary  failures,  i.e., 
fault  containment,  may  require  cartful  design  of  the  interconnection  be¬ 
tween  adjacent  and  groups  of  adjacent  phased  array  radar  elements.  Al¬ 
so,  the  data  output  of  a  failed  data  processor  must  be  prevented  from 
contaminating  other  operating  elements.  The  AF  program  manager  should 
ensure  that  the  SOW  and  CDRL  require  that  an  FMEA  be  performed  at  a 
functional  or  hardware  level  to  indicate  the  consequences  of  element 
failure(s)  in  a  gracefully  degrading  system.  The  purpose  of  carefully 
selecting  the  level  of  detail  in  the  FMEA  is  to  highlight  the  susceptibility 
of  the  design  to  data  contamination  or  secondary  failure(s)  so  that  cor¬ 
rective  redesign  may  be  instituted. 

4.1.8  Fault  Detection  Techniques 

Many  methods  are  available  to  detect  hardware  failures  and  data  er¬ 
rors.  Most  have  been  conceived  to  satisfy  the  goals  of  specific  system 
types  such  is  analog  control,  communications,  and  processing  systems. 
The  different  techniques  used  provide  varyirg  levels  of  three  primary 
characteristics: 

e  Responsiveness  -  Time  to  detect 

e  Failure  Source  isolation  Level  -  Component,  module,  function  or 
system  unit 

e  Implementation  Complexity  -  Directly  related  to  the  cost  to  incor¬ 
porate. 

Most  highly  fault  tolerant  systems  use  a  combination  of  techniques.  Ta¬ 
ble  4-1  lists  the  common  methods  which  include  detection  approaches  for 
both  hardware  and  software  intensive  systems.  The  choice  of  a  specific 
detection  technique  depends  upon  the  nature  and  criticality  of  the  ele¬ 
ment  or  task.  The  cost  and  complexity  of  implementing  it  must  be  asses¬ 
sed  along  with  the  accuracy  of  the  method  used. 


4-20 


4.1.9  Error  Dotation  Codot 

Systematic  coding  of  transmitted  data  is  the  method  most  often  used 
to  detect  errors  that  occur  in  digital  communication.  Errors  can  occur 
singly,  in  multiples  of  random  errors,  or  in  bursts  due  to  timing  incon¬ 
sistencies  or  noise  caused  by  electromagnetic  interference.  Distinct 
classes  of  codes  have  been  configured  to  deal  with  the  various  types  of 
errors  expected.  The  more  complex  error  patterns  demand  the  use  of 
more  sophisticated  error  detection  coding  techniques.  In  addition,  the 
advanced  detection  techniques  can  be  enlarged  and  designed  to  correct 
the  errors;  thus,  they  provide  a  form  of  masking  redundancy.  All 
codes  apply  data  redundancy  to  an  information  stream  that  is  prede¬ 
termined  and  consistent.  Error  correction  is  often  incorporated  in  these 
designs.  The  complexity  and  the  detection/correction  capabilities  of 
some  commonly  used  code  types  are  summarized  in  Table  4-2. 

4.1.10  Distributed  Processing 

Distributed  processing  allows  for  computational  functions  to  be  dis¬ 
persed  among  several  physical  computing  resources.  The  resources  may 
be  geographically  separated  or  co-located.  Computations  are  performed 
locally  but  the  processors  may  be  linked  to  permit  separate  tasks  to  be 
partitioned  between  computational  resources.  Three  important  aspects  of 
distributed  processing  systems  are  the  design  of  the  local  computational 
resources,  the  network  which  allows  the  processors  to  communicate  with 
one  another  and  the  operating  system  used  to  allocate  the  partitioning  of 
the  tasks  to  the  local  processing  elements. 

Distributed  processing  systems  are  one  of  the  most  implementable 
design  techniques  for  fault  tolerant  designs.  The  various  fault  tolerant 
hardware  and  software  techniques  identified  in  this  Guide  can  be  ap¬ 
plied  to  the  local  computational  resources.  The  network  requires  forms 
of  message  checking  and  redundant  communication  paths.  The  operating 
system,  critical  to  the  success  of  the  system  should  generally  be  dis¬ 
tributed  and  redundant  to  minimize  the  impact  of  a  failed  memory  module 
storing  the  program. 


4-21 


Distributed  Processing  offers  many  advantages  over  other  process¬ 
ing  systems  including  the  centralized  approach.  These  advantages  in¬ 
clude  the  following  characteristics: 
e  extendability, 

•  fault  tolerance,  and 
e  implementation  attributes. 

Extendability,  sometimes  referred  to  as  modularity,  flexibility  or 
adaptability  is  the  degree  to  which  system  functionality  and  performance 
can  be  changed  without  changing  the  system  design.  The  major  benefits 
of  extendability  are  ease  of  growth  and  ease  of  modification.  A  high  de¬ 
gree  of  extendability  permits  performance  upgrades  in  small  increments 
at  correspondingly  small  cost  increases.  Reduced  hardware  and  software 
development  and  support  costs  will  be  achieved  by  commonality  of  system 
elements  such  as  nodal  data  processors  and  bus  control  units.  Dis¬ 
tributed  processing  systems'  fault  tolerance  is  enhanced  by  the  multiplic¬ 
ity  of  independent  processors  which  may  improve  fault  detection,  iso¬ 
lation  and  recovery  through  cooperation  of  the  processors.  Graceful 
degradation  is  easily  implemented  in  distributed  processing  systems, 
since  the  loss  of  a  single  processor  may  only  result  in  a  slight  incre¬ 
mental  decrease  in  performance  or  throughput.  Errors  occurring  in  a 
single  processor  are  confined  and  only  a  subset  of  system  functionality 
and  performance  may  be  affected.  Furthermore,  spare  redundant  pro¬ 
cessors  may  easily  be  connected  to  the  network  to  facilitate  meeting  a 
stringent  reliability/availability  requirement.  The  application  charac¬ 
teristics  of  distributed  processing  systems  are  concerned  with  attributes 
such  as  bandwidth,  maturity  and  technology  insertion.  The  system  re¬ 
sponse  time  and  throughput  are  both  improved  by  a  multiplicity  of  pro¬ 
cessors  operating  concurrently.  There  are  also  several  cost-effective¬ 
ness  advantages  for  using  an  aggregate  of  interconnected  smaller  pro¬ 
cessors  instead  of  more  traditional-centralized  systems  of  equivalent  per¬ 
formance.  First,  the  quantity  and  functionality  of  the  smaller  proces¬ 
sor's  logic  is  more  amenable  to  high  levels  of  semiconductor  integration 
than  is  that  of  the  larger  processor.  Second,  smaller  processors  can  be 
designed  and  implemented  more  quickly,  so  they  can  make  use  of  the  lat- 


4-22 


est,  mo*t  cost-effective  hardware  technology.  Finally,  smaller  proces¬ 
sors  are  manufactured  in  greater  quantities  and  thus  benefit  from  pro¬ 
duction  economies. 

However,  there  are  limitations  and  design  issues  associated  with 
distributed  processing  systems.  These  include: 

e  The  amount  of  internal  processing  contained  in  a  node  must  be 
traded  in  the  design  phase  against  the  addition  of  more  computa¬ 
tional  nodes. 

e  The  nodes  in  a  network  can  be  interconnected  in  many  ways 
(fully  connected,  multiply  connected,  star,  ring,  tree,  etc.)  and 
these  must  be  traded  in  the  design  phase  against  connectivity 
and  reliability  goals. 

e  The  bandwidth  requirements  for  the  network  are  driven  by  the 
number  of  messages  and  associated  protocol.  The  bandwidth  in 
turn  dictates  the  technology  used  in  the  implementation.  Inade¬ 
quate  bandwidth  will  degrade  the  response  time  of  the  system. 

e  A  fully  distributed  system  carries  a  substantial  amount  of  over¬ 
head  particularly  in  the  operating  system.  It  is  the  responsi¬ 
bility  of  the  operating  system  to  schedule  tasks  to  the  computer 
resources  and  to  determine  their  health  status.  A  failed  comput¬ 
er  resource  must  be  taken  off  line  and  its  tasks  reallocated  by 
the  operating  system  to  a  healthy  processing  unit.  The  amount 
of  time  allowed  for  reconfiguration  is  driven  by  the  system  re¬ 
quirements,  the  complexity  of  the  operating  system,  and  the 
technology  proposed  for  the  distributed  system. 

e  The  data  base  operating  system  concerns  can  become  complex 
when  other  processing  resources  require  a  non-resident  data 
base;  for  example,  in  extracting  data  which  is  not  local  to  the 
processing  resource. 

Failure  to  address  the  above  design  issues  in  a  timely  manner  can 
result  in  excessively  long  response  times,  poor  reliability  and  increased 
system  costs.  When  a  C3I  systems  development  effort  includes  a  selection 


4-23 


between  centralized  and  distributed  processing  system  approaches,  it  is 
important  that  this  selection  be  made  no  later  than  the  end  of  the  Demon¬ 
stration/Validation  phase.  There  are  significant  complexity,  systems 
integration,  and  development  effort  considerations  associated  with  distri¬ 
buted  processing  systems.  Therefore,  contractor  program  managers 
should  identify  and  schedule  appropriate  trade  studies  and  analyses  to 
support  the  recommended  data  processing  approach . 

Distributed  processing  systems  are  gaining  increased  importance  in 
satisfying  C9I  system  development  objectives.  They  can  be  designed  to 
contain  a  wide  range  of  hardware  and  software  fault  tolerance  techniques 
and  thus  satisfy  stringent  long  life,  autonomous  operation  and  availabil¬ 
ity  requirements.  Further  information  including  R/M/T  impacts  of  vari¬ 
ous  distributed  processing  architectures  is  provided  in  the  Fault 
Tolerant  Design  Implementation  Guide. 


4.1.11  HARDWARE  AND  SOFTWARE  FAULT  TOLERANT  DESIGN 
CHECKLIST  QUESTIONS 

The  following  questions  will  provide  guidance  for  AF  program  man¬ 
agers  and  contractor  designers  during  the  development  of  system  archi- 

3 

tectures  for  fault  tolerant  Cl  systems. 

a.  What  system  requirement  has  driven  the  decision  to  incorporate 
redundancy?  (C) 

b.  Has  the  decision  to  incorporate  redundancy  techniques  been 
based  upon  a  tradeoff  analysis? 

c.  Have  the  co&t  benefits  of  other  reliability  improvement  techniques 
(e.g.  parts  derating,  design  simplification,  environmental  stress 
screening,  ate.)  been  considered  prior  to  the  decision  to  dupli¬ 
cate  hardware/software? 

d.  What  alternate  redundancy  technique(s)  have  been  identified 
which  satisfy  the  allocated  reliability  requirement?  Do  these  al¬ 
ternates  result  in  lower  system  weight  or  cost? 


4-24 


* 


e>  Has  the  designer  considered  the  added  complexity,  coot,  end 
Wight  of  fault  detection,  isolation,  switching,  end  other  peri¬ 
pheral  devices  needed  to  implement  the  particular  redundancy 
configuration?  (C) 

f.  Have  the  leveia  of  Implementation  of  redundancy  been  selected 
with  testability  considerations  in  mind? 

g.  Has  the  reliability  and  availability  of  the  system  been  accurately 
modeled  at  the  level  of  implementation  of  redundancy? 

h.  Has  the  switch  failure  rate  been  Incorporated  in  the  reliability 
model  of  stan&y  and  hybrid  redundancy? 

i.  Has  the  voter/comparator  failure  rate  been  incorporated  in  the 
reliability  model  of  voting  redundancy  configurations?  What 
means  have  been  taken  to  prevent  the  voter  from  becoming  a 
single  point  failure? 

j.  Have  the  following  approaches  been  considered  when  pooled 
spares  are  to  be  employed  in  mission  and  safety  critical  applica¬ 
tions?  (C) 

e  design  soft  turn-on  circuitry  for  cold  spares 
e  operate  with  standby  spares 
e  operate  with  flexing  of  spares. 

k.  la  the  distributed  processing  operating  system  redundant  so  as 
to  minimize  the  impact  of  a  failed  memory  module? 

l.  Does  the  operating  system  periodically  check  the  health  status  of 
spare  redundant  modules? 

4.2  MAINTAINABILITY/TESTABILITY  IMPACT  ON  FAULT  TOLERANT 

DESIGN  OPTIONS 

4.2.1  Testability  of  Fault  Tolerant  Designs 

The  success  of  most  fault  tolerant  systems  depends  largely  on  the 
design's  inherent  diagnostic  capability  and  testability--specifically  its 
ability  to  detect,  identify,  and  report  malfunctions  so  that  suitable 
corrective  action  can  be  taken.  The  selection  of  a  redundant  design 
technique  must  include  an  assessment  of  associated  diagnostic/testability 
alternatives  and  their  overall  impact  on  the  achievement  of  the  design 


.  -■& 


goal  performance  requirements.  The  methods  chosen  to  implement  tha 
dlagnostic/testability  taslc  dapand  on  what  la  balng  taatad  with  tha  fault 
tolerance  design  option.  One#  a  mathod  it  chosen,  constraints  imposad 
upon  it  bagin  to  raveal  themselves  from  various  othar  intardapandant  re¬ 
quirements.  Datlgn  for  pure  testability,  whathar  In  a  fault-tolerant 
framawork  or  not,  can  ba  refilled  only  aftar  tha  dasignar  hat  daalt  with 
conatraintt  such  at  coat,  availabla  raal  attata,  also  and  wolght  limita¬ 
tions,  availabla  power,  and  intarfaca  comp  laxity  rastrictions. 

Whan  performing  tradaa  to  secure  additional  real  estate  or  complex¬ 
ity  for  diagnostic/testability  capability  for  fault  tolerant  ay  stems,  tha  da¬ 
signar  has  more  freedom  than  the  designers  of  conventional  systems. 
This  is  because  tha  added  hardware  and  software  tor  tha  test  function 
serve  multiple  purposes:  First,  performance-monitoring  testing  assures 
the  user  that  the  equipment  is  working.  Secondly,  this  testing  capabil¬ 
ity  helps  to  isolate  faults  to  a  replaceable  module.  Thirdly,  in  standby 
redundant  strategies,  the  built-in  self-test  or  diagnostic  function  must 
detect  and  identify  malfunctions  so  that  tha  standby  or  redundant  func¬ 
tion  can  be  switched  in.  This  third  functional  requirement  demands  that 
designers  be  more  responsivr  to  tha  diagnostic/testability  needs.  By 
integrating  all  three  diagnostic  capabilities  into  a  cohesive  concept,  the 
overall  task  can  be  accomplished  much  more  easily. 

One  of  tha  most  demanding  requirements  imposad  on  tha  diagnostic/ 
testability  capability  of  fault  tolerant  system  design  is  a  quick  response 
time  to  reconfigure.  Systems  w’th  no  critical  reconfiguration  response 
times  are  free  to  have  self-test  diagnostics  put  into  a  low  priority  back¬ 
ground  mode.  These  systems  need  not  compete  for  processing  time, 
serial  bus  access,  or  slow  electro-mechanical  relay  switching  time,  to 
mention  just  a  few  examples. 

Additional  hardware  and/or  software  may  be  required  to  internally 
test  a  function  such  as  a  seif-test  diagnostic  capability.  This  addition 
may  be  beyond  what  is  necessary  to  perform  its  normal  dedicated  funct:on. 
As  a  general  rule,  contractor  program  managers  should  establish  a  goal 


4-26 


that  the  tost  .plrcuitry  to  be  added  hat  a  failure  rata  an  ordar  of  magni¬ 
tude  battar  than  tha  functional  circuitry  to  ba  tasted.  This  goal  may  be 
relaxed  if  the  program  manager  it  satisfied  that  it  is  too  stringent  and 
wuld  compromise  tha  ability  to.  satisfy  other  critical  system  design  re¬ 
quirements.  Air  Force  program  managers  should  assure  themselves  that 
tha  ratio  of  BIT  circuitry  failure  rata  to  functional  circuitry  failure  rata 
is  not  excessive.  This  will  minimize  corrective  maintenance  events  due 
to  failures  of  an  overiy  complex  BIT  diagnostic  function.  However,  there 
should  be  sufficient  diagnostic  capability  built  into  the  design  to  reliabiy 
carry  out  these  detection,  identification  and  reporting  test  functions. 

An  important  factor  influencing  the  diagnostic/testability  design  is 
new  technology.  Today,  more  can  ba  accomplished  with  a  package  of  the 
same  size  as  that  of  10  years  ago.  However,  there  is  a  tendency  to  shy 
away  from  new  technology  because  of  lack  of  confidence  resulting  from 
insufficient  field  testing.  Risks  associated  with  single-source  procure¬ 
ment  have  been  used  as  a  possible  reason  to  reject  a  good  solution. 
Therefore,  although  determining  how  to  test  a  function  may  be  readily 
resolved  by  using  a  new  technique,  alternate  solutions  are  often  sought 
because  of  lack  of  confidence  in  the  new  technique. 

System  reliability  can  be  improved  by  using  redundancy  techniques, 
but  caution  must  be  exercised  in  this  approach.  Fault  detaction  and  iso¬ 
lation  ere  often  the  limiting  factors  when  designing  redundancy  into  the 
system.  For  example,  a  subsystem  may  consist  of  a  number  of  redun¬ 
dantly  configured  items  and  the  reconfiguration  strategy  may  require 
isolating  a  failed  item  before  an  operationally  redundant  item  can  be 
switched  into  its  place.  Depending  upon  function  criticality,  redundant 
units  can  be  switched  in  either  at  the  first  indication  of  a  failure  or  af¬ 
ter  a  failure  indication  has  been  sustained.  In  either  case,  after  the 
spare  unit  has  been  switched  in,  the  operating  system  can  command  more 
exhaustive  BIT  on  the  faulty  module  and  log  the  unit  as  failed  if  con¬ 
firmed  by  BIT,  or  return  the  unit  to  standby  or  active  status  if  the  fail- 


4-27 


ere  is  not  confirmed.  When  considering  adding  more  redundant  items, 
caution  ts  required  sines  the  diagnostic/testabllHy  of  failed  items  is 
seldom  1001  perfect.  When  the  non -perfect  probabilities  of  correct  fail¬ 
ure  detection  and  isolation  are  taken  into  account,  it  Is  entirely  possible 
that  the  subsystem  probability  of  mission  success  may  not  increase  with 
the  addition  of  redundant  Hems. 

It  may  be  helpful  to  review  some  of  the  more  desirable  design  con¬ 
siderations  for  diagnostics/taatability  before  establishing  what  can  be  ex¬ 
pected  from  the  diagnostic  design  of  a  fault  tolerant  system.  These  de¬ 
sign  considerations  include  the  following: 

a.  Comparison  Method  -  An  effective  method  for  testing  similar  sys¬ 
tems  with  similar  inputs  and  outputs  is  to  compare  outputs  and 
flag  any  gross  disagreements,  it  is  desirable  to  provide  a  means 
to  determine  which  branch  is  faulted. 

b.  Redundancy  Verification  -  The  built-in  test  should  test  each  re¬ 
dundant  path  individually  whenever  possible,  to  prevent  the 
masking  of  faults  in  redundant  items. 

c.  Flexing  of  Spares  -  Periodically  activate  all  available  assets  when 
continuous  or  concurrent  fault  detection  methods  are  utilized 
within  hot  spares,  so  that  the  built-in  test  of  the  hot  spares  is 
activated  and  reported  out  before  these  items  are  needed  and 
switched  in. 

d.  Voting  Scheme  Technique  -  A  typical  example  of  a  voting  scheme 
technique  is  to  compare  output  vaiues  from  three  different 
sources.  Confidence  is  placed  in  that  value  where  at  least  two 
of  the  three  sources  agree.  The  source  of  the  erroneous  value 
should  be  corrected  at  an  appropriate  maintenance  schedule. 

e.  Error  Correction  -  Detection  of  degraded  performance  in  stages 
preceding  an  error-correcting  function  is  difficult.  This  is  be- 


4-28 


* 


cauM  the  error- correcting  function  makes  it*  preceding  degraded 
stag*  appear  healthy.  The  error-correcting  functions  should 
keep  a  count  of  the  number  of  times  corrections  had  to  be  made. 
When  a  predetermined  threshold  count  is  exoeeded,  a  test  signal 
may  be  Injected  to  determine  if  the  input  stage  is  unacceptably 
degraded. 

f.  Multiple  Redundancy  -  In  highly  redundant  systems  which  are 
allowed  to  gracefully  degrade  through  failures  of  redundant  ele¬ 
ments,  a  test  should  be  established  to  verify  that  minimum  ac¬ 
ceptable  system  performance  levels  are  met  during  system  opera¬ 
tion. 

g.  Echo  Message  -  When  it  is  necessary  to  transmit  long  messages, 
the  ability  to  echo  back  a  message  is  particularly  useful.  This 
feature  provides  confidence  that  the  message  has  been  accurately 
received.  A  time  out  is  usually  set  in  anticipation  of  the  echo 
message.  If  nothing,  or  if  an  erroneous  echo  is  received  before 
the  time  out  has  elapsed,  the  message  is  sent  again  and  a  fault 
flag  is  set. 

h.  System  Check  -  Severe  and  damaging  faults  often  render  it  im¬ 
possible  for  a  system  to  check  itself.  One  by-product  of  redun¬ 
dancy  is  the  fact  that  without  much  complication  a  system  that  is 
capable  of  checking  itself  can  also  check  out  another  system  like 
itself.  Therefore,  it  may  be  advantageous  to  have  similar  sys¬ 
tems  periodically  check  each  other. 

i.  Redundant  Bus  -  Provision  for  a  status  word  has  been  included 
successfully  in  1553-type  systems  utilizing  redundant  buses. 
Subsystem  access  to  the  bus  is  completely  controlled  by  a  bus 
controller.  Each  subsystem  is  informed  by  the  bus  controller 
when  to  send  and  when  to  receive  a  message.  Every  time  a 
subsystem  receives  such  information  from  the  bus  controller,  the 


subsystem  sands  s  status  word  back  to  tho  bus  controller.  This 
status  word  usually  contains,  a  number  of  bits  reflecting  the 
health  of  the  subsystem,  the  actual  word-count  received,  the 
comparison  results  of  the  expected  word-count,  the  word-count 
it  is  presently  sending,  etc.  If  the  bus  monitor  detects  an  er¬ 
ror  within  the  bus  system,  it  automatically  switches  over  to  the 
redundant  bus  and  reports  this  out  upon  demand.  Maintenance 
personnel  can  isolate  a  fault  quickly  by  observing  failure  indica¬ 
tions  from  the  bus  monitor  as  well  as  from  the  various  subsys¬ 
tems. 

j.  Non-Volatile  RAM  -  A  microprocessor's  ability  to  access  a  non¬ 
volatile  RAM  serves  a  dual  purpose.  First,  it  can  log  fault- 
detection  information  that  may  be  retrieved  by  maintenance 
personnel  after  power  has  been  shut  off.  Secondly,  it  can  log 
software  errors  detected  and  trapped  during  on-line  program¬ 
ming.  A  third  possible  service  worth  noting  is  the  use  of 
non -volatile  RAMs  to  periodically  check  certain  computed  values. 
Power  transient  induced  faults  would  then  become  tolerable 
because  the  processor  *ould  have  to  only  "roll  back"  to  the  val¬ 
ue  stored  at  the  checkpoint  rather  than  begin  the  entire  compu¬ 
tation  all  over  again. 

k.  Intermittent  Faults  -  One  way  to  identify  intermittent  faults  is  to 
log  every  detected  occurrence  into  memory  (possibly  non-volatile 
memory).  Once  the  trend  of  an  intermittent  fault  is  determined, 
effective  corrective  action  can  be  taken. 

l.  Signal  Elements  -  It  is  often  imperative  that  C3I  signals  be  sent 
in  hostile  and  jamming  environments.  Receivers  can  accurately 
interpret  a  signal  even  if  V8  of  <ts  total  initial  format  is  lost. 
Although  these  receivers  work  extremely  well,  higher  levels  of 
fault  detection  coverage  would  be  difficult  to  achieve  with  con¬ 
ventional  overall  wraparound  tests  or  even  quick  operational 


checks.  At  dose  rents,  diets  systems  perform  perfectly  without 
tntsnnst  or  tvsn  without  thsir  power  smplifiert.  Elegant,  lo¬ 
calized  sensitivity  tests,  therefore,  can  be  built  into  the  equip¬ 
ment.  If  the  equipment  la  unacceptably  deg  reded,  the  demodu¬ 
lation  elements  must  present  their  own  fault  flag  outputs. 

m.  CewtfOri  Indication*  -  Fault  tolerance' can  be  applied  toe  variety 
of  system  types,  i.e.,  electrical,  mechanical,  hydraulic,  environ¬ 
mental,  etc.  Regardless  of  the  system  type,  it  is  customary  to 
include  a  caution  indication  whenever  a  backup  system  is  called 
into  service,  especially  when  a  failure  within  the  backup  system 
could  be  hazardous  to  those  involved. 

4.2.2  Maintainability  of  Fault  Tolerant  Designs 

The  ability  to  meet  fault  tolerant  requirements  imposed  upon  a  sys¬ 
tem  is  greatly  influenced  by  its  capability  to  detect,  isolate,  and  repair 
malfunctions  as  they  occur  or  are  anticipated  to  occur.  This  mandates 
that  alternate  maintainability  and  diagnostic  concepts  be  carefully  studied 
and  reviewed  before  committing  to  a  final  design  approach.  A  mainte¬ 
nance  plan,  based  upon  the  system's  maintainability  features  and  diag¬ 
nostic  capabilities,  must  then  be  developed  so  that  it  optimizes  logistics 
resource  requirements.  The  repair  scenario  should  be  viewed  from  as 
global  a  position  as  possible  to  accurately  determine  the  true,  bottom-line 
cost.  The  unscheduled  Organizational  (O)  level  maintenance,  although 
a  major  part,  still  is  only  a  portion  of  the  total  overall  maintenance  ac¬ 
tivity.  Other  maintenance  activities  include  scheduled/preventive,  O 
level  inspection  and  service.  Intermediate  (I) -level  maintenance,  and  De¬ 
pot  (D) -level  maintenance.  The  cost  of  each  level  contributes  to  the 
ICC  which  should  be  the  driving  measure  for  any  decision  a  maintenance 
planner  makes. 

Probably  the  most  important  steps  a  maintainability  engineer  must 
take  are  defining  effective  Maintainability  and  Diagnostic  concepts  that 
are  capable  of  meeting  the  mission  performance  requirements  while  min- 


4-31 


imizing  LCC.  Usually,  thin  in  •  handful  of  options  available,  but  bo* 
fore  ono  can  intelligently  choose  the  oorroct  approach,  tome  basic  and 
typical  questions  should  bo  answered: 

a  What  ana  the  overall  mission  reliability  taquirements? 
o  Do  these  requirements  demand  multiple  redundancies  and/or  so¬ 
phisticated  techniques  to  enhance  reliability? 
e  Whet  is  the  system's  allowable  loss  probability  per  operating  hour 
requirement? 

•  What  are  the  system  performance  monitoring  requirements? 
e  What  are  the  required  maximum  and  mean  times  to  repair? 
e  What  are  the  risk  areas  that  demand  attention? 
e  Will  on-line  or  in-flight  maintenance  be  required  or  even  bo  pos¬ 
sible? 

e  What  is  the  Fraction  of  Faults  Isolatable  (FFI)  design  goal? 
e  What  percentage  of  the  maintenance  diagnostics  can  be  achieved 
by  the  embedded  diagnostics  provided  to  meet  safety  and  func¬ 
tional  performance  requirements? 
e  Can  BIT  eliminate  and/or  complement  ATE  requirements? 
e  Can  the  intermediate  level  of  maintenance  be  minimised  or  elim¬ 
inated? 

e  Can  the  equipment  design  be  functionally  partitioned  to  facilitate 
a  modulo- level  maintenance  concept? 
e  Can  reliance  on  support  equipment  be  eliminated? 
e  Can  the  ability  to  record  maintenance  history  (in-flight  and 
on-ground)  be  provided  within  the  onboard  diagnostic  system 
design? 

The  appropriate  answers  to  those  and  other  pertinent  questions, 
will  help  formulate  the  maintainability  and  diagnostic  concepts  necessary 
for  the  system.  In  addition,  by  reviewing  previous  history  and  the 
available  and  allowable  resources  (such  as  man-hours,  personnel  skill 
levels,  GSE  requirement  and  system  availability  requirements),  better 
Judgments  can  be  made  on  logistics  decisions  such  as: 

e  How  often  should  a  corrective  maintenance  action  be  expected? 


4-32 


•  Should  tho  designer  plan  for  tchodulod  maintenance,  and  If  so, 
how  often? 

a  How  many  and  what  typas  of  spa  rat  should  ba  stocked? 

a  Whara  should  tha  sparas  ba  stocked? 

a  Should  an  Instruetlva  computar  program  ba  davalopad  to  aid  the 
tachnicians  involvad  in  maintenanca  and  fault  isolation  activities? 

How  a  system  la  to  ba  maintained  should  ba  analysed  In  parallel 
with  how  It  should  ba  designed  to  meat  Its  reliability,  availability  (heavi¬ 
ly  Influenced  by  maintainability)  and  survivability  requirements.  Tha 
maintenance  concept  should  ba  considered  early  In  tha  design  phase  of 
tha  program,  since  there  Is  a  batter  chance  to  develop  a  coat-effective 
and  efficient  system  if  maintainability  is  an  Initial  design  concern. 

Providing  an  efficient,  cost  affective  means  of  maintaining  a  C*l 
system,  without  hindering  mission  performance  (or  affecting  mission  re¬ 
quirements),  requires  that  a  design  vs.  corrective  maintenance  trade-off 
analysis  ba  conducted  early  in  the  development  process.  For  example, 
to  achieve  a  mission  reliability  goal  with  a  K  out  of  N  redundant  system 
(see  para.  4.1.6),  more  frequent  restoration  of  redundant  elements 
would  result  in  a  lower  number  of  required  total  redundant  elements  (but 
in  higher  maintenance  hours).  Conversely,  if  operational  considerations 
dictate  an  extended  time  period  between  redundancy  restoration,  then  a 
larger  number  of  redundant  elements  would  be  required  to  satisfy  the 
mission  reliability  goal.  Details  of  design  vs.  corrective  maintenance 
trades  are  illustrated  in  the  Fault  Totarant  Design  Implementation  Guide. 

Before  a  decision  is  reached  on  selecting  a  particular  redundancy 
scheme,  contractor  program  managers  should  insure  that  satisfactory  re¬ 
sponses  are  obtained  for  the  following  typical  maintenance  related 
questions: 

e  What  methods  will  be  used  to  fault  detect  (FD)  and  fault  isolate 
(FI)?  How  effective  will  the  FD/Ft  tests  ba?  What  faults  cannot 
be  detected  and/or  isolated  using  the  FD/FI  tests? 


i 

I 

f 

•  What  la  thf  risk  that  an  unscheduled  corrective  maintenance  f 

} 

action  wilt  adversely  affact  tha  mission?  is  it  toiarabia? 

a  How  many  manhours  would  ba  nacassary  to  parform  anticipatad 
unschaduiad  malntananca  actions? 

a  W*  would  ba  tha  mean -time- to -repair  (MTTR)  for  such  a  sys¬ 
tem? 

a  Dost  this  MTTR  meet  tha  systam  parformanee  requirements? 

a  Can  tha  systam  provide  full  service  during  an  unschaduiad  main¬ 
tenance  activity  (consideration  must  ba  given  to  power  supplies, 
maintenance  technician  and  tool  access,  possible  shorting  of  adja¬ 
cent  channels,  etc.)? 

a  How  many  spares  must  ba  stocked  and  at  how  many  locations? 

a  How  long  does  it  taka  to  raplanish  the  spares  inventory? 

Table  4-3  presents  attributes  of  soma  of  tha  options  available  for 
maintaining  fault  tolerant  C*l  systems  requiring  high  readiness  levels. 

4.2.3  MAINTAINABILITY  AND  TESTABILITY  CHECKLIST  QUESTIONS 
Maintainability 

a.  Will  tha  Maintainability  concepts  be  developed  in  parallel  with 
other  concepts  proposed  for  achieving  reliability,  availability  and 
survivability  requirements? 

b.  Have  the  costs  of  all  toe  required  maintenance  levels  been  con¬ 
sidered  before  presenting  a  maintenance  concept?  (C) 

c.  What  Maintenance  concept  options  will  best  provide  an  efficient, 
cost-effective  means  to  maintain  a  Cal  systam  without  hindering 
mission  performance? 

Testability/Diagnostics 

a.  What  resources  will  be  required  for  Testability/Diagnostics  to 
meet  toe  fault  tolerance  design  goals? 

b.  Can  the  BIT  and  BITE  design  (used  to  detect  and  isolate  faults 
for  performance  monitoring  and  maintenance)  be  used  to  achieve 
the  desired  fault  tolerance  performance  levels? 

c.  What  additional  constraints  are  imposed  upon  toe  testability/di¬ 
agnostics  design?  (C) 


4-34 


■  in  will 


TABLE  4-3.  MrintomnoB  Conoopt  Opt  tom. 


MANttMANtt 

concept 

DESCRIPTION 

TYPICAL  APPLICATIONS 

ADVANTAGES 

DISADVANTAGES 

ONLINE 

DESIGN  ALLOWS  RAPID 
RESTORATION  OF  THE 
SYSTEM  BY  REPLACE¬ 
MENT  OF  BIT/FIT 
IDENTIFIED  LRU*  ANO 
ISM*  WITH  SPARES. 

HIGH  CRITICALITY  STRATEGIC 
SYSTEM  FUNCTIONS,  La.  DATA 
PROCESSING,  COMMUNICATION 
LINKS,  ETC.  ALSO  IN-FLIGHT 
ON-EQUIPMENT  MAINTENANCE 
WITH  ON40ARD  SPARES. 

SYSTEM  CONTINUES  FULL 
OPERATION  OR  WITH 
MINOR  INTERRUPTION  IN 
SERVICE. 

ADDED  COMPLEXITY  OF 
FO/FI.  ANO  SWITCHING. 
ADDED  COST  OF  DUPLI¬ 
CATED  EQUIPMENT  AND 
ON  LINE  SPARES. 

DEFERREL 

0ESI8N  ALLOWS 
SCHEDULING  NON- 
CRITICAL  MAINTE¬ 
NANCE  AT  A  MORE 
CONVENIENT  TIME  OR 
PUCE. 

ACCEPTABLE  DEGRADED  MODES 
OF  OPERATION  AND  OTHER 
GRACEFULLY  DEGRADING 
SYSTEMS.  NON-CRITICAL 
EQUIPMENT  FAILURES. 

SYSTEM  CONTINUES 
OPERATING.  MORE 
EFFICIENT  USE  OF 
MAINTENANCE  MAN¬ 
POWER  AND  SCHEDULE. 

FULL  PERFORMANCE 
CAPABILITY  MAY  NOT 

BE  AVAILABLE.  IF 

NEEDED. 

OPPORTUNISTIC 

DESIGN  ALLOWS  CON¬ 
TINUED  OPERATION 

WITH  A  OEGRAOED 
SYSTEM  UNTIL  THE 
REQUIRED  MIX  OF 
SPARES.  ATE.  PERSON¬ 
NEL  AND  SCHEDULE  IS 
AVAILABLE  TO  PER¬ 
FORM  THE  DEFERRED 
MAINTENANCE. 

ACCEPTABLE  OEGRAOEO  MODES 
OF  OPERATION  ANO  OTHER 
GRACEFULLY  DEGRADING 
SYSTEMS.  NON-CRITICAL 
EQUIPMENT  FAILURES 

SYSTEM  MAINTAINS  HIGH 
READINESS.  MORE  EFFI¬ 
CIENT  USE  OF  MAINTE¬ 
NANCE  MANPOWER  AND 
SCHEDULE. 

FULL  PERFORMANCE 
CAPABILITY  MAY  NOT 

BE  AVAILABLE,  IF 

NEEDED. 

PREPOSITIONEO 

COMPREHENSIVE  MAIN¬ 
TENANCE  IS  LIMITED 

TO  SPECIFIC  SITES.  THE 
SYSTEM  CAN  BE  DIVERT¬ 
ED  OR  TRANSPORTED 
FROM  ITS  OPERATION¬ 
AL  SITE  TO  A  PARTIC¬ 
ULAR  MAINTENANCE 

SITE  TO  PERFORM  A 
PARTICULAR  LEVEL  OF 
MAINTENANCE. 

AIRBORNE  C3I  SYSTEMS  AND 
TRANSPORTABLE  SUBSYSTEMS 

REDUCED  MAINTENANCE 
MANPOWER.  SKILL 

LEVELS  AND  SUPPORT 
EQUIPMENT  REQUIRED. 

MAY  RESULT  IN  DE¬ 
GRADED  READINESS. 

RAPID 

DEPLOYMENT 

DESIGN  PERMITS  SYS¬ 
TEM  OPERATION  FOR  A 
SPECIFIC  TIME  PERIOD 
WITH  MINIMUM  LOGIS¬ 
TICS  ANO  SUPPORT 
RESOURCES. 

GROUND  MOBILE  AND  AIRBORNE 
C3I  SYSTEMS  WITH  SELF- 
CONTAINED  ELECTRICAL 
GENERATORS  AUXILIARY 

POWER  UNITS.  JET  FUEL 
STARTERS,  ETC. 

ENHANCED  TACTICAL/ 
SURGE  CAPABILITY 
DURING  HOSTILE 

ACTIONS. 

ADDED  SYSTEM 
COMPLEXITY. 

AUSTERE  SITE 

DESIGN  PERMITS  SYS¬ 
TEM  OPERATION  FOR 
EXTENDED  TIME 

PERIODS  AT  UNIM¬ 
PROVED  FACILITIES 

WITH  MINIMAL  LOGIS¬ 
TICS  RESOURCES. 

GROUND  ANO  AIRBORNE  CJI 
SYSTEMS  WITH  SELF-CONTAINED 
ELECTRICAL  GENERATORS. 
AUXILIARY  POWER  UNITS  JET 
FUEL  STARTERS,  ETC. 

ENHANCED  SYSTEM 
SURVIVABILITY  DURING 
HOSTILE  ACTIONS. 

ADDED  INITIAL  SYSTEM 
COST. 

SELF 

CONTAINED 

A  SYSTEM  CONTAINING 
SUFFICIENT  FAULT 
TOLERANT  DESIGN 
PROVISIONS  THAT  RE¬ 
QUIRES  LITTLE  OR  NO 
EXTERNAL  MAINTE¬ 
NANCE  TO  COMPLETE 

A  MISSION. 

HIGH  CRITICALITY  STRATEGIC 
NATIONAL  SECURITY  SYSTEMS. 

HIGH  READINESS. 

ADDED  COMPLEXITY, 
WEIGHT.  POWER  AND 
INITIAL  COST. 

R87-3537-016(T) 

4-35 


d.  What  racant  technological  advances  can  ba  used  to  solve  test* 
ability/diagnostics  design  problems?  (C) 

e.  Can  the  ratio  of  the  functional  circuitry  failure  rate  to  BIT  cir¬ 
cuitry  failure  rats  be  leapt  above  10-to-1  as  a  general  rule  and 
still  cover  all  diagnostics  requirements  including  redundancy 
management?  if  this  is  not  possible,  is  there  a  good  reason  for 
exceeding  this  goat?  (C) 

f.  What  are  the  time  constraints  for  BIT  performance  in  the  opera¬ 
tional  time  line? 


*-36 


5  -  FAULT  TOLERANCE  DESIGN  AND  TRADEOFF  ANALYSES 


This  section  provides  the  necessary  management  background  infor¬ 
mation  to  formulate  fault  tolerant  designs  and  conduct  tradeoff  analyses. 
The  approach  described  herein  promotes  the  development  of  balanced  de¬ 
signs  with  R/M/T  attributes  to  enhance  supportability  and  mission  effec¬ 
tiveness  at  minimal  life  cycle  cost. 

5.1  FAULT  TOLERANCE  DESIGN  METHODOLOGY 

Fault  tolerance  must  be  incorporated  into  the  design  as  part  of  the 
system  engineering  process.  Experience  has  shown  that  a  hierarchical 
approach,  involving  the  selective  application  of  fault  tolerant  design 
techniques,  is  most  effective.  Figure  5-1  shows  the  recommended  fault 
tolerant  design  methodology.  This  approach  consists  of  first  creating  a 
baseline  design,  and  then  systematically  introducing  fault  tolerance  to 
meet  the  R/M/T  requirements.  The  process  is  iterative  and  assures  that 
all  system  requirements  can  be  achieved  within  program  cost  and  sched¬ 
ule  constraints. 

5.1.1  Baseline  Design 

The  first  step  in  fault  tolerant  design  process  is  to  develop  a  base¬ 
line  system  architecture  for  the  implementation  technology  that  meets  the 
system  performance  requirements.  This  first  cut  architecture  should  be 
non-redundant,  i.e.,  contain  only  the  minimum  hardware  complement 
needed  to  meet  the  performance  parameters.  Furthermore,  technology 
used  in  the  baseline  design  must  represent  a  reasonable  and  attainable 
development  risk  that  is  consistent  with  the  program  cost  and  schedule 
constraints.  The  use  of  high  risk  technology  that  is  incompatible  with 
program  cost  and  schedule  will  inevitably  result  *n  serious  R/M/T  and 
system  performance  deficiencies. 


5-1/2 


R87-601 1-501 
R»7-3537-017(T) 


'"'MEET^ 
REQMTS . 

I  NO 


•  RECONFIGURATION 
STRATEGIES 

•  STATIC 
REDUNDANCY 

•  DYNAMIC 
REDUNDANCY 

•  FMECA 


•  CONTINUOUS 
BIT 

•  SYSTEM 
INTERRUPTED 
BIT 

•  REDUNDANCY 
TESTING 


'  MEET  ' 
REQMTS 


•  ORIGINAL  •  ANALYTIC 

PERFORMANCE  MODELS 

•  DEGRADED  •  SIMULATIONS 
PERFORMANCE  •  EXPERIMENTS 

•  SAFE 
SHUTDOWN 


•DETAIL  DESIGN  PHASE • 


Figur*  5-1.  Fault  TotaranM  Design  Methodology. 


5-3/A 


5.1.2  Fault  Avoidance  Technique* 

Wh  Ha  tha  baseline  r  i**gn  is  being  developed ,  applicable 
ance  techniques  should  be  identified  and  carefully  evaluated.  These 
techniques  normally  represent  the  most  cost  affective  method  of  increas¬ 
ing  system  reliability.  Typically  they  include  the  following  approaches: 
e  Reduction  of  environmental  stresses,  e.g.,  providing  increased 
cooling  and/or  vibration  isolation.  For  operating  temperatures 
between  10°C  and  50°C,  a  10  to  15  percent  increase  in  reliability 
can  be  expected  for  each  10°C  decrease  in  temperature 
e  Use  of  military  grade  piece  parts  instead  of  commercial  grade 
e  Application  of  a  more  stringent  part  derating  policy  for  new  de¬ 
signs 

e  Imposition  of  environmental  stress  screening  at  the  piece  part 
and  equipment  levels. 

5.1.3  Development  of  the  Fault  Tolerant  Design  Approach 

Section  3  of  this  Guide  describes  the  methodology  used  to  establish 
the  system  fault  tolerance  requirements  based  on  mission  and  safety  crit¬ 
icalities.  In  addition  to  these  contractor-developed  requirements,  the 
Air  Force  imposes  R/M/T  and  availability  requirements  which  significantly 
influence  the  selection  of  fault  tolerant  configurations.  Fault  tolerance 
requirements  are  allocated  by  contractor  personnel  to  each  hardware  ele¬ 
ment  in  the  system.  This  assures  that  fault  tolerant  design  emphasis  is 
directed  at  the  critical  areas  and  not  indiscriminately  across  the  entire 
system.  Compliant  design  approaches  can  then  be  formulated  using  the 
various  fault  tolerant  options  discussed  in  Section  4  of  this  Guide. 

Typically,  three  to  four  designs  are  initially  configured  and  quali¬ 
tatively  evaluated  against  the  major  system  drivers,  i.e.,  performance; 
cost,  weight,  supportability,  etc.  Normally,  the  two  most  promising 
candidate  approaches  are  selected  for  further  configuration  definition  and 
tradeoff  analysis.  Alternate  testability/diagnostic  concepts  must  be  con¬ 
currently  developed  and  included  as  part  of  the  design  tradeoff  process. 
System  level  FMECAs  should  be  conducted  on  each  alternate  candidate 


5-5 


configuration  tq  identify  singia  point  failuroa  and  other  potential  datign 
weaknesses  impacting  safety  and  reliability.  Paragraph  5.2  describas  the 
aneiyfis  methods  that  are  commonly  used  to  evaluate  design  alternatives 
and  select  the  most  desirable  design  approach  prior  to  the  Preliminary 
Design  Review  (FDR).  Design  trades  should  continue  long  after  the 
PDR  and  focus  on  the  detail  design  issues. 

5.1.4  Fault  Detection  Implementation 

After  establishing  tha  system  diagnostic  approach,  appropriate  fault 
detection  techniques  must  be  defined  to  detect  all  relevant  fault  types  in 
a  timely  manner.  Fault  detection  algorithms  are  implemented  via  various 
hardware,  software,  and  repetition  (time)  methods  to  generate  the  initial 
fault  signal.  Fault  detection  algorithms  are  classified  in  accordance  with 
the  time  of  their  application  as  follows: 

e  Continuous  ( Background )  or  Non-Interference  Testing  -  Simul¬ 
taneous  with  normal  system  operation 

e  System  Interrupted  Testing  -  After  normal  operation  has  been 
temporarily  interrupted 

e  Redundancy  Testing  -  Either  concurrently  or  at  scheduled  in¬ 
tervals;  verifies  that  the  various  forms  of  protective  redundancy 
are  themselves  fault-free 

e  Validation  Testing  -  Identifies  system  imperfections  introduced 
during  the  manufacturing  and  programming  processes  prior  to 
system  deployment. 

5.1.5  Recovery  Implementation 

After  a  fault  is  detected,  recovery  algorithms  are  used  to  reconfig¬ 
ure  the  system  to  an  alternate  mode  of  operation  or  safely  shut  the  sys¬ 
tem  down.  Examples  of  reconfiguration  include:  deactivating  s  failed 
processor  and  switching  in  a  standby  spare  processor,  deactivating  a 
faulty  memory  area  end  reallocating  the  remaining  available  storage  area 
of  the  memory.  The  anticipated  extent  of  hardware  damage  as  a  result 
of  a  fault  end  the  time  required  to  resume  system  operation  have  a  major 
Influence  on  the  choice  of  recovery  techniques  that  can  be  used.  Fault 


5-6 


signal-invoked  recovery  algorithms  art  classified  according  to  the  state 
of  the  system  after  recovery  as  follows: 

•  Recovery  to  dHglnal  performance 
e  Recovery  to  degraded  modes  of  system  operation 
e  Execution  of  a  safe  shutdown. 

5.1.6  R/M/T  Evaluation  Techniques 

The  design  activity  must  be  supported  by  a  continual  assessment  of 
the  system's  ability  to  meet  the  R/M/T  and  availability  requirements. 
Current  and  future  C*l  systems  will  utilize  extensive  redundancy,  as 
well  as,  complex  fault  detection  and  recovery  management  techniques. 
The  trend  towards  these  ultra-reliable  fault  tolerant  systems  has  neces¬ 
sitated  the  development  of  sophisticated  R/M/T  evaluation  tools. 

In  the  past,  the  lack  of  redundancy  was  felt  to  be  the  major  source 
of  system  unreliability  and  imperfect  fault  protection  coverage  was 
deemed  to  have  only  a  second-order  effect.  With  the  increased  emphasis 
on  fault  tolerance  for  present  day  Cal  systems,  redundancy  and  fault 
protection  coverage  have  achieved  at  least  parity,  if  not  complete  role- 
reversal.  System  faults  which  occur  may  or  may  not  be  detected,  and 
faults  which  are  detected  may  or  may  not  result  in  correct  isolation  and 
reconfiguration.  Thus,  to  be  of  value,  analytical  reliability  and  avail¬ 
ability  models  must  properly  account  for  the  adverse  effects  of  imperfect 
fault  protection  coverage.  System  reliability  and  availability  figures  of 
rr  it  must  be  determined  by  evaluating  the  inherent  fault  protection 
c  orage  and  the  ability  to  reconfigure  to  alternate  modes  of  acceptable 
sy^em  operation. 

o  date,  most  of  the  reliability  models  used  to  evaluate  complex 
fau1'  'ole rant  systems  are  based  on  Markov  methods.  Some  of  the  more 
popular  models  are  ARIES,  CARE  III,  HARP  and  SURE  and  their  impor¬ 
tant  characteristics  are  summarized  in  Table  5-1.  The  reader  should  re¬ 
fer  to  the  Fault  Tolerant  Design  Implementation  Guide  for  more  informa¬ 
tion  on  this  subject. 


5-7 


I 


ATTNBUTt 

Man 

CARS  El 

HARP 

MMI 

SOS 

LARQt  SYSTEMS 

LARQB  SYSTEMS 

•MALL  SYSTEMS 

SMALL  SYSTEMS 

MATURITY 

MATURE 

MATURE 

RELATIVE  NSW 

RELATIVELY  NSW 

RESULT 

Loan  BOUND 

LOWER  SOUND 

LOWER  BOUND 

UFftR  S  LOWSR 
SOUNDS 

SOLUTION 

EIGENVALUE 

NUMERIOAi 

INTEGRATION 

NUMMICAL 

INTEOMnON 

NEW  ALGE3RAIC 
THEORY 

MODEL  TYPE 

HOMOGENEOUS  MARKOV 

SEMI- MARKOV 

NON- 

HOMOGINtOUB 

MARKOV 

SEMLMARKCN 

INPUT 

SYSTEM  STATE 

FAULT  TREE 

FAULT  TRIt  OR 
•VSTtMITATt 

SYSTEM  SWC 

repairable  system 

YES 

NO 

YES 

NO 

FAILURE  RATES 

EXPONENTIAL 

EXPONENTIAL  OR 

wesuu 

WIIBULL  FOR 

NQN-RtFAIRASLB 

SYSTEMS 

EXPONENTIAL 

FAULT  HANDLING 

INSTANTANEOUS 
CONSTANT  FAULT 
PROTECTION  COVERAGE 

FACTORS  AM  INPUT 

INCLU0E8  A 

SEPARATE  MOOEL 
WHICH  IS 
INDEPENDENT  OF 
SYSTEM  STATE 

CHOICE  OF  7 
MOOCLS,  ONE  OF 
WHICH  ISA 

SIMULATION 

NO  DETAILED 
PARAMETRIC 

MODEL.  MPUT 
SAMPLE  MEANS 

ANDVAISANCES 

OF  RECOVERY. 

SPARES 

RS7-SB  J7-01KT) 

SPARES  HAVE  OWN 

FAILURE  RATES 

HOT  WITH  SAME 
FAILURE  RATE  AS 
ACTIVE  UMTS 

SFARES  HAVE 

OWN  FAILURE 
RATES 

WHEN  COLD,  FAILURE 
RATE  18  ZERO 

WHEN  HOT.  SAME 
FAILURE  RATE  AS 
ACTIVE  UMTS. 

5.1.7  FAULT  TOLERANT  DESIGN  METHODOLOGY  CHECKLIST 
QUESTIONS 

Th*  following  questions  are  intended  to  assist  program  managars  In 
achiaving  tha  appropriata  laval  of  fault  tolaranca  through  tha  dasign  and 
tradaoff  analysis  procass: 

a.  Doas  tha  systams  dasign  approach  includa  fault  tolaranca  as  an 
integral  part  of  tha  systams  anginaaring  procass? 

b.  Doas  tha  system  dasign  approach  claarly  reflect  tha  R/M/T  and 
fault  tolaranca  requirements  and  tha  methods  for  their  evaluation 
and  optimization? 

c.  What  is  tha  overall  diagnostic  strategy? 

d.  What  is  tha  system  level  fault  protection  dasign  concept? 


5-8 


e.  What  analyses/tradeoffs  hava  baan  accompliahad,  or  ara  plannad 
to  aaaura  a  ayatam  architactura  that  minimises  tha  affact  of 
faultaT 

f.  What  ara  tha  fault  containment  strategies?  How  la  data  Integri¬ 
ty,  Including  data  bases,  protected  from  damaga  eauaad  by 
faulta? 

g.  What  ia  tha  aoftwara  ovarhaad  panatty  for  implamantlng  fault  tol¬ 
erance?  What  la  tha  hardware  penalty  for  implamantlng  fault  tol- 
aranca  tachniquaa? 

h.  How  ara  tha  fault  tolerant  ayatam  interfacaa  protactad? 

i.  Haa  a  Hat  baan  developed  of  all  tha  critical  technology  develop¬ 
ments  needed  to  aupport  fault  tolerance?  What  ara  tha  atatea  of 
development  of  theae  tachnologiea? 

j.  How  credible  ia  tha  raliability/availabillty  model  and  supporting 
input  data? 

5.2  R/M/T  DESIGN  TRADEOFF  ANALYSES 
5.2.1  Readiness  Analysis 

Readiness  tradeoff  analyses  are  used  to  evaluate  the  impact  of 
R/M/T  design  features  in  conjunction  with  the  operational  and  mission 
requirements  of  the  system.  Readiness  is  defined  as  the  probability  that 
a  system  is  either  operational  or  ready  to  be  placed  into  operation  at  any 
point  in  time.  The  major  factors  that  influence  readiness  are: 
e  Reliability  and  maintainability  design  characteristics 
e  Field  maintenance  concept  employed 
e  Logistic  resources  available 
e  Mission  and  operational  requirements. 

These  factors  and  relationships  affecting  readiness  are  shown  in  Fig. 
5-2. 

The  readiness  of  a  weapon  system  is  primarily  dependent  upon  the 
"repairability"  characteristics  of  the  design,  i.e.,  the  ability  to  accom- 


5-9 


plish  corrective  end  preventive  maintenance  within  •  prescribed  period  of 
"downtime. "  This  is  expressed  by  the  eveilebility  (readiness)  ratio: 

Uptime 

Availability  (readiness)  « 

Uptime  ♦  Downtime 

The  terms,  operational  availability  (Ao)  and  raadlnaet,  are  essentisily 
interchangeable.  Uptime  is  a  function  of  the  maintenance  interval  or 
mean-time-between-maintenance  (MTBM).  Downtime  is  determined  by  the 
mean  restore  time  (MRT)  to  return  the  system  to  operational  status. 
Therefore,  operational  availability  is  expressed  in  the  following  equation: 

MTBM 

Ao  m  » 

MTBM  ♦  MRT 

The  maintenance  interval  comprises  all  the  maintenance  actions  asso¬ 
ciated  with  functional  failures,  scheduled  maintenance,  inspections, 
cannibalizations  and  false  alarms.  The  MRT  includes  the  actual  mean¬ 
time- to- repair  (MTTR)  a  system  coupled  with  the  elapsed  time  associated 
with  logistic  supply  and  manpower  delays.  MTBM  and  MTTR  are  deter¬ 
mined  by  the  system's  design  features.  Supply  and  awaiting  maintenance 
delays  are  caused  by  logistic  resource  deficiencies  which  can  be  mini¬ 
mized  with  effective  management  and  planning.  Therefore,  system  readi¬ 
ness,  directly  attributable  to  design  characteristics,  is  normally  evalu¬ 
ated  in  the  design  phase  using  the  classical  steady-state  Inherent 
availability  (Ai)  relationship: 

MTBM 

Ai  *  ~ 

MTBM  ♦  MTTR 

The  reliability,  maintainability  and  testability  attributes  are  evalu¬ 
ated  through  design  tradeoffs  which  achieve  a  balance  of  supportability 


5-11 


features  with  the  operational  and  mission  needs ,  and  program  resources. 
As  the  maintenance  interval  is  increased  through  improved  reliability, 
the  inherent  availability  of  a  system  will  approach  100%.  Similarly,  main* 
taxability  design  improvements  can  reduce  the  number  of  false  alarms 
and  expedite  maintenance.  This  improves  the  availability  of  the  system 
by  increasing  the  Interval  between  maintenance  (MTBM)  and  reducing  the 
MTTR.  The  availability  ratio,  MTTR/MTBM,  is  used  extensively  In  de¬ 
sign  tradeoffs  to  eeeees  the  reliability,  maintainability  end  testability  Im¬ 
pact  on  syetsm  availability,  ee  Illustrated  In  Fig.  5-3.  Ae  this  ratio  de¬ 
creases,  either  through  an  increase  in  the  maintenance  interval  or  re¬ 
duction  in  the  restore  time,  the  system  avallability/readinees  improves. 


M7-3IS7<QSO(T) 


FlpeU.  ggjgtjomjijp  c4  Awllabilitv  md  >t«  Driew  MTTn  0  MTU 


Utilization  is  an  important  consideration  for  systems  that  are  sub¬ 
jected  to  long  periods  of  inactivity  and  brief  actual  operating  times. 


5-12 


• « IS33R33EB  3353323 


With  fixed  logistic  assets  available,  an  Increase  in  system  utilization  tax¬ 
es  the  maintenance  resources  and  decreases  system  readiness. 

5.2.2  Logistics  Resource  Analysis 

Effective  management  of  logistic  resources  (personnel,  facilities, 
equipment  and  spares)  Is  essential  to  achieve  a  high  state  of  system 
readiness.  Program  managers  must  allocate  sufficient  funds  to  establish 
logistic  requirements  and  purchase  the  necessary  logistic  resources. 
Even  highly  reliable  systems,  when  they  fall,  can  suffer  rapid  degrada¬ 
tion  in  system  readiness.  To  avoid  deterioration  of  a  system's  readiness 
capability,  logistic  planners  must  provide  properly  trained  maintenance 
personnel,  facilities  and  test  equipment  to  accomplish  maintenance,  and 
sufficient  quantities  of  replacement  parts  and  materials.  Figure  5-4 
shows  how  a  system's  operational  availability  (Ao)  will  decay  with  in¬ 
creasing  restore  time  due  to  delays  in  logistic  supply  and  maintenance 
personnel. 


5-13 


Maintenance  manpower  requirements  are  based  on  the  number  and 
types  of  skills  required  to  perform  the  repair  and  scheduled  maintenance 
tasks  at  the  anticipated  maintenance  frequencies.  The  spares  require¬ 
ments  for  a  program  are  normally  determined  by  performing  a  level  of 
repair  analysis  as  part  of  the  Logistic  Support  Analysis  (LSA)  activity. 
This  analysis  establishes  the  most  economic  level  of  repair  (assembly, 
subassembly,  component)  and  identifies  where  the  repair  should  be  ac¬ 
complished  (organization,  intermediate,  depot)  based  on  the  maintenance 
concept.  Logistic  downtime  is  highly  dependent  upon  the  level  of  re¬ 
pair,  the  repair  facility  and  the  number  of  spares  available. 

5.2.3  Mission  Effectiveness  Analysis 

Mission  effectiveness,  E(t)  is  a  measure  of  a  system's  capability  to 
accomplish  its  mission  objectives  within  the  stated  operational  demand 
time.  E(t)  is  expressed  as  the  product  of  the  operational  availability 
(Ao),  mission  reliability  (R(t))  and  the  system  performance  index  Pg  as 
follows: 


E(t)  =  A  R(t)P 
o  s 

This  expression  takes  into  account  the  probability  that  the  system  will 
be  available  on  operational  demand  (Aq),  the  probability  of  not  experi¬ 
encing  a  critical  system  failure  (R(t)),  and  the  percentage  of  mission 
objectives  that  can  be  expected  to  be  accomplished  (Ps).  For  a  C3I  sys¬ 
tem,  the  system  performance  index  would  relate  the  mission  objectives  to 
system  capabilities  such  as,  area  of  surveillance,  target  detection 
probability,  etc.  The  availability,  reliability  and  performance  parameters 
are  defined  in  terms  of  the  normal  and  degraded  modes  of  system  opera¬ 
tion.  Configuration  trades  affecting  reliability,  supportability  and  readi¬ 
ness  can  then  be  evaluated  as  a  function  of  mission  effectiveness. 

5.2.4  Life-Cycle  Cost  (LCC)  Analysis 

Tradeoffs  are  not  meaningful  unless  they  can  be  expressed  in  terms 
of  a  common  parameter.  In  terms  of  readiness,  cost  is  the  best  common 


5-14 


denominator  for  normalizing  tha  affacta  of  all  tha  divert*  variables  as¬ 
sociated  with  R/M/T  and  logistics.  As  an  example,  the  maintainability 
characteristics  of  a  C*l  system  can  be  quantified  in  terms  of  acquisition 
and  operating  and  support  (O&S)  costs  by  identifying  the  impacts  on 
prime  equipment,  personnel,  support  equipment,  spares,  publications, 
etc.  However,  the  cost  impact  of  having  a  multi-million  dollar  system 
unavailable  for  a  mission  is  difficult  to  measure.  It  is  therefore  conve¬ 
nient  to  evaluate  a  system  in  terms  of  weapon  system  ready-hours.  By 
addressing  the  problem  on  a  total -force- level  basis,  it  can  be  shown  that 
a  few  weapon  systems  with  a  high  readiness  rate  can  be  as  effective  as  a 
larger  number  of  weapon  systems  with  a  lower  readiness  rate.  A  break¬ 
even  point  will  define  when  it  is  more  cost-effective  to  procure  additional 
weapon  systems,  rather  than  incorporate  additional  readiness  improve¬ 
ments  . 

It  is  sometimes  advantageous  to  work  with  a  worth  value  rather 
than  cost  directly.  Costs  can  be  converted  to  the  worth  of  a  ready- 
hour  by  dividing  the  anticipated  LCC  of  the  weapon  system  by  the  num¬ 
ber  of  ready-hours  (requirement  or  goal)  during  the  life-cycle  of  the 
system.  The  relationship  is: 

LCC  Per  System  Total  Dollars 


R  x  SL  x  365  Days/Yr.  x  24  Hrs./Day  Ready-Hours 


where: 

R|  =  Readiness  Index  =  Worth  of  a  Ready-Hour 
R  =  Readiness  Rate 
and  SL  =  Service  Life  (Yrs.) 

Using  the  readiness  criteria,  any  improvement  to  the  system  can  be 
evaluated  on  a  cost  effectiveness  basis.  As  an  example,  for  a  system 
with  a  readiness  goal  of  80%,  a  service  life  of  20  years,  and  an  antici¬ 
pated  LCC  of  $75,000,000  per  weapon  system,  a  readiness  index  of 


5-15 


$535/ Ready -Hour  is  obtained.  For  this  particular  system,  if  the  cost  of 
saving  one  ready-hour  over  its  service  life  exceeds  $535  the  improvement 
should  not  be  implemented. 


5.2.5  R/M/T  DESIGN  TRADEOFF  ANALYSIS  CHECKLIST  QUESTIONS 

a.  Have  probabilistic  and  quantitative  readiness  goals  and  re¬ 
quirements  been  defined?  (PA) 

b.  Have  system  utilization,  on-station  demand,  and  critical 
turn-around  requirements  been  quantified?  (PA) 

c.  Have  the  following  factors  been  considered  in  developing  spe¬ 
cific  operational,  maintenance  and  support  requirements? 

e  Facility  needs 

e  Manpower  constraints  and  loading 
e  Maintenance  state  of  the  art 
e  System  support  concept 

•  Levels  of  repair 

e  Provisioning  and  stock-out  levels 

e  Special  support  equipment  and  diagnostic  test  architecture 

•  Maintenance  publication  and  training 
e  Special  and  readiness  inspections 

d.  Are  logistic  support  cost,  LCC,  level  of  maintenance,  and 
mission  simulation  model  requirements  defined  in  support  of 
system  readiness  trades  and  effectiveness  analysis? 

e.  Has  provision  been  made  to  use  results  of  readiness  analysis 
for: 

e  Support  of  design  trades? 
e  Optimization  of  support  systems? 

e  Progress  towards  meeting  system  demand  requirements? 
e  Identifying  readiness  risks? 

f.  Are  reliability  and  maintainability  quantitative  requirements 
adequately  defined  at  all  levels  (system,  subsystem,  compo¬ 
nent,  etc.)  to  ensure  necessary  quantitative  readiness  assess¬ 
ments? 

g.  Are  warranties  and/or  contractor  maintenance  factors  con¬ 
tained  in  the  readiness  equations?  If  so,  how  are  they  to  be 
implemented? 


5-16 


6  -  ACRONYMS 

Analog  to  Digital 
Air  Force 

Automated  Reliability  Interactive  Estimation  System 
(Computer  Program) 

Automatic  Test  Equipment 
Airborne  Warning  And  Control  System 

Built-In  Test 
Built-In  Test  Equipment 

Contractor 

Computer  Aided  Reliability  Estimation  (Computer 
Program) 

Command,  Control,  Communications  and  Intelligence 
Critical  Design  Review 
Contract  Data  Requirements  List 
Concept  Exploration  (Phase) 

Commercial  Off-the-Shelf 
Central  Processing  Unit 

Depot 

Department  of  Defense 
Defense  Support  Program 

Engineering  Change  Proposal 
Environmental  Stress  Screening 


-.ta 

I 


6-1 


FD 

FFI 

FI 

FMEA 

FMECA 

FRACAS 

FSD 

FSED 

GSE 

HARP 


I 

ILS 

IMU 

IR 

Joint  STARS 

LCC 

LRM 

LRU 

LSA 

MRT 

MTBCF 

MTBF 

MTBMA 

MTBMI 

MTTR 


Fault  Detection 

Fraction  of  Faults  Isolatable 

Fault  Isolation 

Failure  Mode  Effects  Analysis 

Failure  Mode,  Effects  and  Criticality  Analysis 

Failure  Reporting,  Analysis  and  Corrective  Action 

System 

Full  Scale  Development  (Phase) 

Full  Scale  Engineering  Development  (Phase) 

Ground  Support  Equipment 

Hybrid  Automated  Reliability  Predictor 
(Computer  Program) 

I  ntermediate 

Integrated  Logistic  Support 
Inertial  Measurement  Unit 
Infrared 

Joint  Surveillance  Target  Attack  Radar  System 

Life-Cycle  Cost 
Line  Replaceable  Module 
Line  Replaceable  Unit 
Logistic  Support  Analysis 

Mean  Restore  Time 

Mission -Time- Between -Critical- Failure 
Mean -Time- Between -Failure 
Mean-Time- Between-Maintenance- Action 
Mean-Time-  Between-Maintenance- 1  nherent 
Mean  -Time-To-  Repai  r 


6-2 


N-Modular  Redundancy 


NMR 

0 

OtS 

PA 

PDR 

PM 

PROD 

RSM 

RAM 

RF 

RFP 

RIW 

R/M/T 

RQT 

SOW 

SRR 

SRU 

ST 

SURE 


TMR 

TPS 

VALID 

VHSIC 

VLSI 


Organizational 
Operating  and  Support 

Procuring  Activity 
Preliminary  Design  Review 
Preventive  Maintenance 
Production  and  Deployment  (Phase) 

Reliability  and  Maintainability 

Random  Access  Memory 

Radio  Frequency 

Request  For  Proposal 

Reliability  Improvement  Warranty 

Reliability,  Maintainability,  Testability 

Reliability  Qualification  Test 

Statement  of  Work 
System  Requirements  Review 
Shop  Replaceable  Unit 
Self-Test 

Semi-Markov  Unreliability  Range  Evaluator 
(Computer  Program) 

Triple  Modular  Redundancy 
Test  Program  Set 

Demonstration  and  Validation  (Phase) 

Very  High  Speed  Integrated  Circuit 
Very  Large  Scale  Integration 


6-3 /A 


APPENDIX  A 


GLOSSARY  OF  RELIABILITY,  MAINTAINABILITY,  TESTABILITY 
AND  FAULT  TOLERANCE  TERMS 

AVAILABILITY:  A  measure  of  the  degree  to  which  an  item  is  in  an 

operable  and  committable  state  at  the  start  of  a  mission  when  the 
mission  is  called  for  at  an  unknown  (random)  time.  (Item  state  at 
start  of  a  mission  includes  the  combined  effects  of  readiness- related 
system  reliability  and  maintainability  parameters,  but  excludes 
mission  time. )  (1) 

COVERAGE,  FAULT  PROTECTION:  The  conditional  probability  that  the 
system  will  recover  should  a  fault  occur. 

The  specification  of  the  types  of  errors  against  which  a  particular 
redundancy  scheme  guards.  (2) 

DEPENDABILITY:  A  measure  of  the  degree  to  which  an  item  is  operable 
and  capable  of  performing  its  required  function  at  any  (random) 
time  during  a  specified  mission  profile,  given  item  availability  .t  the 
start  of  the  mission.  (Item  state  during  a  mission  includes  the 
combined  effects  of  reliability  and  maintainability  parameters  but 
excludes  r  -mission  time.)  (1) 

ERROR:  An  undesired  resource  state  that  exists  either  at  the  boundary 
or  at  an  internal  point  in  the  resource  and  may  be  perceived  as  a 
failure  when  it  is  propagated  to  and  manifested  at  the  boundary. 
(3) 


A-l 


FAILURE:  The  event,  or  inoperable  state,  in  which  any  item  or  part  of 
an  item  does  not,  or  would  not,  perform  as  previously  specified. 
(1).  A  loss  of  service  that  is  perceived  by  the  user  at  the  bound¬ 
ary  of  the  resource.  (3) 

FAULT:  The  immediate  cause  of  failure  (e.g.r  mat -adjustment,  mis¬ 
alignment,  defect,  etc.)  (1).  The  identified  or  hypothesized  cause 
of  the  error  or  failure.  (3).  A  fault  may  be  latent  and  undetected 
until  it  propagates  and  causes  an  error  or  functional  failure  at  a 
higher  level  of  operation. 

FAULT,  DESIGN:  A  generic  fault  designed  into  a  function,  including 

hardware  and  software  faults  and  faults  of  other  logical  entities, 
such  as  data  bus  interfaces. 

FAULT  DETECTION:  The  process  of  determining  that  an  error  caused 

by  a  fault  has  occurred  within  the  system.  An  undiscovered  fault 
is  classified  as  a  latent  fault. 

FAULT,  INTERMITTENT:  Hardware  faults  which  result  in  recurring 

inconsistent  functional  behavior  of  the  hardware  followed  by  recov¬ 
ery  of  its  ability  to  perform  within  specified  limits  without  any 
remedial  action.  Intermittent  faults  cannot  occur  in  software  or 
logic. 

FAULT  ISOLATION:  The  process  of  determining  the  location  of  a  fault 
to  the  extent  necessary  to  effect  repair,  correction,  or  restoration 
to  specified  performance.  (1) 

FAULT,  LATENT:  A  fault  which  exists  but  has  not  been  detected. 

FAULT,  PERMANENT:  A  fault  which,  once  it  occurs,  is  irreversible 


A-2 


except  for  parmanent  removal  from  the  system. 


FAULT  RECOVERY:  Tha  ability  of  tha  syatam  to  provide  tha  required 
aarvica  or  parformanca  or  to  corract  arrora  aftar  a  fault  haa  baan 
detected. 

FAULT,  TRANSIENT:  A  fault  not  cauaad  by  a  permanent  defect  but 

rather  one  which  manifasta  a  faulty  behavior  for  aome  finite  time 
and  than  ia  fault  free.  A  permanent  or  intermittent  fault  which 
only  occaaionally  produces  discrepant  results  is  not  a  transient 
fault. 

FAULT  TOLERANCE:  A  survivable  attribute  of  a  system  that  allows  it 
to  deliver  its  expected  service  after  faults  have  manifested  them¬ 
selves  within  the  system.  (3) 

FAULT  TOLERANT  SYSTEM:  A  system  that  has  provisions  to  avoid 
failure  after  faults  have  caused  errors  within  the  system.  (3) 

ITEM:  A  generic  term  which  may  represent  a  system,  subsystem,  equ ' 

ment,  assembly,  subassembly,  etc.  depending  on  its  designation  in 
each  task.  (4) 

MAINTAINABILITY:  The  measure  of  the  ability  of  an  item  to  be  retained 
in  or  restored  to  specified  condition  when  maintenance  is  performed 
by  personnel  having  specified  skill  levels,  using  prescribed  proce¬ 
dures  and  resources,  at  each  prescribed  level  of  maintenance  and 
repair.  (1) 

MEAN-TIME-BETWEEN-FAILURE  (MTBF):  A  basic  measure  of  the  system 
reliability  parameter  related  to  avaiiability  and  readiness.  The  total 
number  of  system  life  units,  divided  by  the  total  number  of  events 


A-3 


in  which  the  system  becomes  unevaileble  to  initiate  its  mission(s), 
during  a  stated  period  of  time.  (1) 

MISSION-TIME-BETWEEN-CRITICAL-FAI  LURES  (MTBCF):  A  measure  of 
MISSION  RELIABILITY:  The  total  amount  of  mission  time  divided 
by  the  total  number  of  critical  failures  during  a  stated  series  of 
missions.  (1) 

OPERABLE:  The  state  of  being  able  to  perform  the  intended  function. 

0) 

REDUNDANCY:  The  existence  of  more  than  one  means  of  accomplishing 
a  given  function.  Each  means  of  accomplishing  the  function  need 
not  necessarily  be  identical.  (1) 

REDUNDANCY,  ACTIVE:  The  redundancy  wherein  all  redundant  items 
are  operating  simultaneously.  (1) 

REDUNDANCY,  STANDBY:  That  redundancy  wherein  the  alternative 

means  of  performing  the  function  is  not  operating  until  it  is  activat¬ 
ed  upon  failure  of  the  primary  means  of  performing  the  function. 
(1) 

RELIABILITY:  (a)  The  duration  or  probability  of  failure-free  per¬ 
formance  under  stated  conditions.  (1). 

(b)  The  probability  that  an  item  can  perform  its  intended  function 
for  a  specified  interval  under  stated  conditions.  (For  non- 
redundant  items  this  is  equivalent  to  definition  (a).  For  redundant 
items  this  is  equivalent  to  the  definition  of  mission  reliability.)  (1) 

RELIABILITY,  MISSION:  The  ability  of  an  item  to  perform  its  required 


A-4 


functions  for  tho  duration  of  tha  specified  mission  profile.  (1) 


TESTABILITY:  A  design  characteristic  which  allows  the  status 

(operable,  inoperable,  or  degraded)  of  an  item  to  be  determined 
and  the  isolation  of  faults  within  the  item  to  be  performed  in  a 
timely  manner.  (4) 

NOTE:  The  sources  of  key  definitions  are  given  in  parentheses  follow* 
ing  the  definition.  The  source  identification  codes  are: 

(1)  MIL-STD-721C,  "Definition  of  Terms  for  Reliability  and  Main¬ 
tainability." 

(2)  D.  P.  Siewiorek,  R.  S.  Swarz,  "The  Theory  and  Practice  of 
Reliable  System  Design",  Digital  Press,  1982. 

(3)  A.  Avizienis,  J.  C.  Laprie,  "Dependable  Computing:  From 
Concepts  to  Design  Diversity",  Proceedings  of  the  IEEE,  Vol. 
74,  No.  5,  May  1986. 

(4)  MIL-STD-2165,  "Testability  Program  for  Electronic  Systems  and 
Equipment. " 


A-5/6 


APPENDIX  B 


REFERENCES 

& 

LIST  OF  GOVERNMENT  DOCUMENTS 


a 


APPENDIX  B 
REFERENCES 

(1)  Carol!,  J.  A.,  at  of.,  "Reliability  Demonstration  Technique  for  Fault 
Tolerant  Systems".  Proceedings  Annual  Reliability  and  Molntainobih 
Ity  Symposium,  1987  January,  316-320. 

(2)  Musson,  T.  A.,  "System  RfcM  Parameters  from  DoD  Directive 
5000.40."  Proceedings  Annual  Reliability  and  Maintainability  Sympo¬ 
sium ,  1981 

(3)  National  Security  Industrial  Association,  Integrated  Diagnostics 
Group.  "Guidelines  for  Preparation  of  Diagnostic  Requirements", 
July  21,  1986. 

LIST  OF  GOVERNMENT  DOCUMENTS 

DoD  Directive  5000.40 

DoD-STD-2167 

MIL-HDBK-338 

MIL-STD-470A 

MIL-STD-471A 

MIL-STD-756B 

MIL-STD-781C 


MIL-STD-785B 


Reliability  and  Maintainability 
Defense  System  Software  Development 
Electronic  Reliability  Design  Handbook 
Maintainability  Program  for  Systems  and 
Equipment 

Maintainability  Verification/Demonstration/ 
Evaluation 

Reliability  Modeling  and  Prediction 
Reliability  Design  Qualification  and 
Production  Acceptance  Tests:  Exponential 
Distribution 

Reliability  Program  for  Systems  and 
Equipment  Development  and  Production 


MIL-STD-882B 

MIL-STD-1388/1A 

MIL-STD-1521B 

MIL-STD-1574A 

MIL-STD-2165 

NHB  1700. 7A 


System  Safety  Program  Requirements 
Logistics  Support  Analysis 
Technical  Reviews  and  Audits  for  Systems, 
Equipment,  and  Computer  Software 
System  Safety  Program  for  Space  and 
Missile  Systems 

Testability  Program  for  Electronic  Systems 
and  Equipment 

Safety  Policy  and  Requirements  (NASA 
Publication) 


B'2 


it-:-.  .HttaAiS. hi 


r.4.  MM 


